As many of us are moving from working at a college or University to working at home, so the ways that data is flowing across the Janet network are changing.
Up until this week, the largest traffic flows across Janet were inbound from GÉANT and the major content providers towards our members and customers. The Janet network has direct connections (peerings) with the larger domestic broadband providers, plus many peerings with smaller providers and content providers across the LINX, the largest Internet Exchange Point (IXP) in the UK.
On Tuesday evening the Prime Minister told the country we should consider working from home where possible. That was reflected in the traffic we saw on the network. Between Wednesday 11th and Wednesday 18th March, we had shed about a third of our incoming traffic to Janet.
That’s understandable. There are fewer people taking their laptops onto campus and accessing services that are hosted off Janet such as Google, Youtube, Office365, etc. In addition to that, an increasing number of services provided by our members are provisioned in cloud providers, so when they’re accessed from home, the traffic uses the broadband provider’s own peerings with the the cloud providers and doesn’t touch Janet.
So, with all of that, there’s plenty of spare capacity on the backbone, yes?
Well, yes there is, but along with an overall reduction in traffic coming in to the network we saw a marked increase in traffic outbound to the larger domestic service providers such as BT and Virgin Media as people accessed resources that are still hosted on campus, or used VPNs that tunnelled all their traffic to their institution. This wasn’t a marginal increase either, our Private Network Interconnects (PNIs) with the broadband providers, which had happily been sitting at less than 50% utilisation at any point before (including evenings, weekends and bank holidays), started to congest.
That required some manual intervention from the Janet NOC to move traffic from the PNIs to the LINX, on which we had more spare capacity, whilst we started the process to add more capacity.
Then, on Thursday, we started getting complaints about poor performance. None of the PNIs or IXP connections were overloaded, and we couldn’t see what might have been causing it. On top of that, some members said that when they changed provider the problem disappeared, which suggested it might have been a problem with one of our peers.
Our engineers, who in turn were working from home via a number of providers — BT, Virgin Media, Andrews & Arnold— hadn’t noticed problems on their own connections either.
However, reports kept coming in from our members, and we started getting a couple of reports via providers that suggested our connection to the LINX might have been at fault. The interface was reporting no errors, and double-checking the configuration to ensure anti-DDoS measures weren’t catching the wrong traffic didn’t reveal anything either. We then logged onto the LINX’s stats portal and that showed that they had been receiving errors from us over the last couple of days, but the error rate had jumped drastically over the preceding few hours.
This is a 100 gigabit ethernet interface which uses four lanes of 25Gbps to provide 100GE, and it appeared that one of the four laser diodes in the transceiver (a device about the size of a matchbox) had dropped its output power compared to the other three.
Usually we would then contact our maintenance providers to swap out the part, but fortunately there was an unused transceiver in one of the other routers in the PoP, and an engineer that could be there in 20 minutes (as opposed to a four hour SLA for the maintenance provider). We swapped the transceiver and immediately got reports that the problems had eased.
There is no congestion on the Janet network or our external peerings at the moment, but as we settle into what is likely to be the ‘new normal’ for the foreseeable future, we are still in the process of adding capacity, and when that is done we’ll remove the manual steering of traffic towards the LINX.
An anecdotal observation on the traffic levels over the past couple of days is that instead of us all working through lunchtime, which seems to happen when we’re in the office, we currently see a drop off in traffic between 12pm and 2pm whilst you are running errands or having a proper lunch hour — or at least accessing fewer resources on Janet!
The next thing to watch is going to be what happens when the schools (largely) close next week, we’ll be keeping a close eye.