Liquid faced another fiber break this morning that resulted in a nationwide internet outage. Instead of the internet coming to a dead stop, it instead just got quite slow. This is the worst really as I prefer a total blackout than to have to endure slow internet. But what really goes on when the internet goes through a fault-induced blackout in the form of a fiber break?
How the network is designed
Back in computer class in primary or high school, we were taught the types of networks. Bus, ring, star, mesh, and hybrid. It sounded trivial back then but these same concepts are the ones applied to nationwide network design taking advantage of each type of network topology. Ring, mesh, and hybrid are the most common types of network topologies used and this is for 2 main reasons. Reliability and economy.
Reliability
A reliable network is one where each node has multiple links to and from it. This way if one fails then the others can share the load to keep the whole network operational. This is usually achieved using a Mesh topology. It’s the most reliable topology on a network however it is the least economical. Connecting each network element or node to every other element or node in the network is fine up to a certain point. Beyond that, it becomes exponentially expensive and unrealistic.
Economy
We need a reliable network but we also need to keep it economical. So for us to be able to do this we need to mix topologies so we can get an acceptable level of reliability without breaking the bank. This is where hybrid topologies come into play. A hybrid topology is when you combine characteristics of multiple topologies into the network.
What goes on behind the scenes when a fault occurs
IAPs have measures put in place for faults. The big ones like Liquid and Telone will have multiple links feeding the internet into the country. Liquid Intelligent solutions have 7 of these across the borders of Zimbabwe. However, only 5 of these connect to our neighboring countries and have eventually terminate at an undersea cable which is how a majority of the internet arrives in Africa.
All these links will have different throughput capacities and can be organized that way. So a main link will have the highest capacity and consequent links after that will have lower and lower capacities. So if the main link fails, the load is shared amongst all the available links. However, it’s not like all these other links are kept idle. They will also be operational and so if a high-capacity link fails, the remaining links might lack enough capacity to accommodate all the traffic that the main link was carrying. And this will congest the network and give all of you extremely slow speeds.
Why can’t all the links just have the same capacity?
There are 2 sides to capacity that exist with IAPs. The installed capacity of their link and the bandwidth capacity that they buy from other IAPs in territories that have access to undersea fiber cables. Installed capacity is an easy one to sort out because it’s just hardware. It can be scaled up to whatever level the IAP is comfortable with.
However, bandwidth is basically how much capacity you reserve for yourself. And the more capacity you reserve for yourself, the more you have to pay. This can become very uneconomical and so you’ll find out that an IAP can choose to have a few links that have the highest volume of paying traffic having the highest bandwidth available to them and the backup links having very little bandwidth to play with. Since these major faults are not a frequent occurrence, they take the gamble to keep costs down.
The most ideal scenario is having a duplication of the network where either hot standby or active standby is employed. Hot standby is when the main system fails, the backup system kicks in and has the same capacity as the main system. Active standby is when both the main system and the backup system are operating simultaneously and if one fails, the other just takes up the load with no drop in performance.
These redundancy methods are used already within some core network elements like gateway routers and switches. But it is not applied to the nationwide network because of how expensive it would be to duplicate everything.
Locally hosted services will work as normal
Whilst the general internet will be down for the count, a fault like this does not affect any service that is locally hosted and managed. So company intranets, VPNs, and LAN networks will not see any disruption in service because the link between you and the server where the service is running from will be intact.
So you may have noticed that some services that do not seem like internet services may have some challenges as well. Some USSD codes for some businesses, banks, and service providers that run on international servers on the back end will experience service outages. One of those rare cases where having some services hosted locally can be an advantage.
3 comments
Interesting. Had missed this post.
Well explained post.
Thumb up.
To a lot of us users internet was completely off for over 6hrs. Not even a single Whatsapp message was going or coming. As long as some experienced zero transmission then its correct to also say it was completely off. Which then begs the question, do these backup topologies actually work? Clearly their 5 backup undersea cables are not optimum if one break causes a national outage on their network. Maybe they don’t take fault tolerance that seriously as they believe such events are minimum probability and low risk and prefer to save a bit of coin. One undersea cable break should at least result in at most slow speeds for the worst affected users but not a complete blackout as we’ve experienced today, making us think they only have one undersea cable operating.