This is one of those things that they can’t teach you in a course…
– ipvdre
It was about 8:33 am when I was talking to my manager about what I had planned for the day.
I was heading out to a site to move some computers around, issue a device count then head back to the office.
In preparation for moving the computers around, I was issuing some preliminary configurations, specifically, port-security. Each device has a 1 to 1 relationship with each switch port. So, I had to make the necessary configurations. This ensured that the computers I was moving would have connectivity.
Thinking back to it now, I could have just went over to the patch panel and swapped the connections.
Ah well.
When I was done, I saved the configuration and was about to log out of the switch.
But, since I was already working on this switch, I figured I’d update the firmware. I’ve pushed and activated this firmware to about 11 of our other switches over the last 4-5 months and figured it was stable.
Plus, this upgrade would only take 2-3 minutes tops and the site doesn’t open for another hour and a half.
I had time.
So I uploaded the firmware and issued:
boot system backup
reload
y
and waited for the switch to come back online…
…2 minutes passed.
3 minutes passed…
…5 minutes passed…
Suddenly, I get notifications that the switch was offline.
I tried logging back into the switch via SSH and kept receiving:
“Connection Refused.”
The switch did NOT come back online…
So I got up, went to the network closet, grabbed a spare switch of the same make and model, brought it back to my desk, booted it up, whipped my laptop out, logged into GitHub where I had, thankfully, backed up configurations for this switch, downloaded it and uploaded the spare switch with this configuration.
I grabbed a bunch of Ethernet cables, along with the spare switch and headed onsite.
…one of the most important things you can do is focus on what you control when, not if, WHEN something like this eventually happens.
– ipvdre
When I got on site, one of the staff members greeted me.
“Good morning! Right on time! Our network is down.”
“Good morning! I know. I’m sorry about that. I pushed a little update and it took the whole network down for some reason. I’ll have everything back up in a few.” I said.
“Oh! Wow!”
She said something about the weather but I was more focused on the faces of, what seemed to be, 8, 10, 20 or 100 staff members who couldn’t log on(it was actually about 8 people but it felt like 100 lol).
I had to walk past all of them to the back to where the network closet was.
I consoled into the switch and attempted to revert the switch back to the previous firmware.
2 minutes go by…
…3 minutes go by.
Tested network connectivity.
....
Nothing.
At this point, I think I might have bricked the switch. But that can’t be the case because I was able to log into the switch via console.
I walk out of the network closet and through the office. I pass the front desk where those 200 people(lol) were standing around waiting to log in for work. I go to the car to grab the spare switch and cables.
I’m back in the network closet, when I decided to test one more thing. I compared the running configuration with the backed up configuration I had saved on my laptop – they were identical.
My theory, at this point, is that the configuration must have been corrupted. I decided to log into the switch’s web server, upload the backup configuration and reboot the switch.
The switch console comes back up. I tested connectivity to good ole “8.8.8.8”
..!!
I packed up the spare switch, the cables and brought it back to the car. But not before telling everyone that they can log in.
This is one of those things that they can’t teach you in a course, emotional resilience. In situations like these, you have to KNOW that you can get through it.
But one of the most important things you can do is focus on what you control when, not if, WHEN something like this eventually happens.
I always tell people who aspire to be network engineers, or IT professionals in general, that things WILL break – it’s only a matter of time. You have to be prepared for it.
How did I prepare?
Automated backup configurations
Prior to this situation happening, I decided to take a weekend to learn Ansible.
I manage about forty switches that lack a formal centralized management platform. So, I decided to address this issue by learning Ansible and deploying it in our environment. But not before setting it up in my home lab environment and running numerous tests, ofcourse.
I wrote a few basic playbooks for:
Automating backups of configurations
MAC address table versioning
VLAN output
These are just some of the playbooks that I use for basic network maintenance. And automating backup configurations was instrumental for dealing with that unforeseen network outage.
When you have a mindset that facilitates foresight and curiosity, you’ll be on the lookout for tools and skills that you can apply to those “just in case” scenarios.
– ipvdre
Having spare switches ready to go
Having spare switches, or spare network devices in general, in your network closet is a must.
What do you do when a router, switch or firewall mysteriously stops working?
Having spare devices on hand is not enough though, you need to know how to:
Upload configuration
Upgrade firmware
Console into the device
Configure an IP address on the device
Configure SSH on the device
SSH into the device
Here’s a little scenario. What would you do if you had to go on site to address a network issue and the device, for whatever reason, wasn’t reachable via its management IP address?
This might sound like a crazy question to ask but would you know how to console into the device?
You’d be surprised to know that many IT professionals do not know how to do this.
Or, what if you’ve forgotten your console cable. With knowledge of the management IP address, would you know how to gain access to the switch through SSH?
Having these devices on hand makes you more resilient, technologically and emotionally, because you have a safety net.
I had such safety net when this outage occurred. And it made me resilient in the face of this outage.
I’m always learning
The more you learn, the more tools you’ll have at your disposal. But this isn’t just about tools.
It’s about mindset.
When you have a mindset that facilitates foresight and curiosity, you’ll be on the lookout for tools and skills that you can apply to those “just in case” scenarios.
This foresight takes mishaps into account long before they happen. Again, we cannot control everything that happens.
But knowing, from experience, whether it was your experience or other learning from other peoples experience, what went wrong in the past, you can prepare for what will eventually happen again in the future.
For example, we know that outages happen, it is only a matter of time. How you prepare for that outage depends on which part of the network caused the outage. Like a failed switch, a failed gateway and the reasons for these failures.
Either way, we should know from experience, and just common sense, that it is always a great idea to build redundancy into our networks.
And if we can’t build redundancy into the network, because of budget constraints, we should come up with creative, intuitive ways to subvert complete disasters by having backup configurations, spare devices and the tools to implement those things.
Final Thoughts
If you’re an aspiring network engineer, understand that it is only a matter of time before you take a whole network down due to something as innocuous as a routine firmware upgrade that you’re done a thousand times before.
As we become more experienced professionals, we tend to build up a hubris that takes for granted that it’s not only the complex configurations we need to be careful of, but the routine tasks that seem innocent enough until we take down a whole site and we have to make the trek up to said site, past all of the seemingly upset staff members.