How ML can prevent network outage in a 5G world
Estimated reading time: 5 minutes
Rules-based systems can鈥檛 cope when a network generates millions of alarms. Time to consider machine learning algorithm蝉鈥
Every year, mobile network operators spend nearly a quarter of their revenue on network management and maintenance.
They have little choice.
It goes without saying that network outages annoy customers. This is not just bad for an MNO鈥檚 reputation, it is also potentially ruinous for its bottom line.
Why? Because downtime is one of the biggest drivers of . In one study (of cable customers), cited network performance as their main reason for quitting. Among 鈥榗onditional churners鈥 鈥 those considering leaving 鈥 75 percent reported network issues.
In addition to lost customers and damaged credibility, network outages can also lead to costly employee overtime and possibly even penalties for not meeting service level agreements (SLAs).
It all explains why one study estimated that network outages cost
the world鈥檚 MNOs .
However, that study was carried out in 2013. Today, the stakes are so much higher.
This is because traditional telecommunications carriers have become Internet service providers (ISPs). Thanks to the widespread adoption of 4G, people and enterprises now .
So, a network outage no longer leads merely to dropped phone calls but 鈥 potentially 鈥 to the failure of critical enterprise services.
Now, as the industry prepares to , the volume of data signals generated by the network is set to escalate yet again.
It raises the question: how can MNOs predict and fix network faults in such a complex environment filled with so much 鈥榥oise鈥?
The solution for more forward-looking carriers lies with handing over diagnostics to advanced systems that use machine learning to:
鈥 Accurately analyse millions of network alarms
鈥 Instantaneously identify alarms most likely to lead to a fault
鈥 Discover alarm relationships and alarm families
鈥 Eliminate 鈥榥oisy鈥 alarms
Before we dive deeper into these ML-based solutions, let鈥檚 look at how the majority of today鈥檚 network fault detection systems work.
Network diagnostics now: people, rules and alarms
Every day, the world鈥檚 MNOs engage in a battle to keep their networks operational. Unfortunately, there is plenty that can go wrong.
In addition to technical faults (physical link failures, traffic congestion/overload, chip failures) there are other, more unpredictable factors 鈥 from cyberattacks to thunderstorms.
When something in the system does fail, it shows up as an anomaly in the network data. So MNOs put in place monitoring systems that trigger alarms when these anomalies occur. Network Operation Centers (NOC) manned by teams of human analysts then scrutinise this data and try to answer three key questions:
鈥 What happened to the network?
鈥 Why did it happen?
鈥 What will happen next?
The problem with this human-centered approach is obvious. NOC agents can only process a limited amount of information. They have to prioritise their attention. For this reason, they 鈥榤ute鈥 the majority of the alarms they receive.
In so doing they risk ignoring small problems that might develop into bigger faults and bring the network down.
Chris Neisinger, Chief Technology Officer at , elaborates on why this 鈥榦ld school鈥 approach is limited.
鈥淢ost fault detections systems use rules that were defined and tuned by experts based on their experience,鈥 he says.
鈥淏ut these systems can only find what they are looking for. And they rely on people to update them. The problem is that humans are unreliable. Engineers write things down and maybe don鈥檛 pass them on. Meanwhile, rules get too complex to maintain or they change and then you get false alarms.
鈥淎lso, network engineers can鈥檛 look at every signal, so they put in silencing features. As a result, they miss things. They look through their logs after an outage, and often find a signal from weeks before that they muted.鈥
Clearly, these human-centered alarm-based systems are struggling to cope with the volume of signals generated by today鈥檚 networks.
However, things are about to get vastly more challenging. Standalone 5G (SA) is coming 鈥 and it will drastically increase network complexity yet again.
Standalone 5G: an unprecedented explosion in network data
How much greater is the challenge of analysing the data emerging from standalone 5G networks than their 4G predecessors?
.
Why? Because the 鈥榮tandalone鈥 5G Core is . It is virtual 鈥 with foundational technologies (Network Function Virtualization and Software Defined Networking) that turn physical network components into software.
Virtualization will make 5G networks much bigger and able to support millions more connections. MNOs can use this extra capacity to allocate bandwidth to enterprises. Effectively, this 鈥network slicing鈥 gives private companies the ability to run their own mini-networks.
All of which vastly escalates the amount of data generated by the network as a while.
And to further complicate matters, the vast majority of connected 5G devices will be machines. So when there is an outage, these devices will not be able to report a problem.
Ken Rokoff, VP Head of Product & Strategic Alliances at Guavus, believes many industry insiders radically underestimate the extent of the 5G data deluge coming their way.
He uses a simple analogy to illustrate the depth of this misunderstanding.
鈥淧eople lack perspective when it comes to 5G,鈥 he says. 鈥淭hey think, well I can already ride a bicycle, so now I am going to ride a motorcycle. But this move from 4G to 5G is more like going from a bicycle to stepping into the cockpit of an Airbus plane.
鈥淭he sheer amount of data and telemetry involved is orders of magnitude greater.鈥
Rokoff believes the exponential increase in complexity will compel carriers to consider a new automated approach to network analytics.
鈥淚n the past, MNOs have used rules to solve problems,鈥 he says. 鈥淏ut the truth is, rules based systems won鈥檛 work in a 5G world, where there are millions of elements rather than hundreds.
鈥淚n this world, you need systems that can do advanced network analytics. They are no longer a 鈥榥ice to have鈥. They are mandatory if you are going to efficiently run your network.鈥
How to mitigate the risks: moving to ML-based fault detection
In the world of network fault analytics, everything comes back to three acronyms:
鈥 MTTA (mean-time-to-acknowledge)
鈥 MTTD (mean-time-to-diagnose)
鈥 MTTR (mean-time-to-resolve)
Mobile networks have to reduce these numbers, if they want to reduce the number of damaging outages.
However, we have already established how difficult it is for rules-based systems to handle the sheer volume of alarms generated by today鈥檚 mobile networks.
For this reason, many forward-looking MNOs now use to do the work instead.
These systems carry out the task of monitoring network activity without human intervention. They can manage millions of alarms simultaneously. Over time, they can identify which to act upon and which to ignore.
Here are four ways in which ML-based systems produce better results.
#1. They escalate alarms for predicted incidents
Probabilistic algorithms prioritize alarms that have a high probability of leading to network incidents. Typically, this is just 10 percent of all alarms. Engineers can then resolve these issues without relying on network inventory, topology or static rules.
#.2 They reduce the noise of low-impact alarms
Conversely, ML-based systems learn over time which alarms don鈥檛 indicate serious problems. They deprioritize them. They can also suppress any alarm related to scheduled maintenance events.
#3. They consolidate alarms
Sometimes a single event can trigger multiple alarms. ML-based systems can consolidate them into one. This avoids engineers wasting time investigating multiple trails.
#4. They reveal relationships between alarms
Similarly, ML-based systems can gather a set of alarm families together for further root issue analysis.
Of course, the ultimate pay-off of advanced analytics solutions is that they are self-healing 鈥 and can even anticipate faults before they occur.
Chris Neisinger says: 鈥淭hese models are continuously training themselves 鈥 and they get more accurate over time. So let鈥檚 say the operator adds a new cell site. This changes the network architecture, and it means a new model needs to be created.
An ML-based system will automatically adjust itself. Within days it will be very accurate again. And no person needs to be involved.鈥
Related articles:
5G: how can enterprises protect themselves?
5 ways business is getting an industrial 5G makeover
5G versus 4G: what's the difference?
5G SIM, security and privacy for IMSIS
3 reasons to be optimistic about data privacy in the 5G era