It can be tough running software as a service. Apple Maps has had widespread downtime reported today leading to lots of people poking fun at them and their status as second to Google Maps (dozens of people were affected jokes). The other day I was driving with Maps running when the GPS lost where I was and the arrow marker started drifting off-road randomly around town. I snapped a photo at apparently the perfect moment as the marker drifted by Lost Ln. Not sure if that was a precursor to this outage, but the photo and timing seem appropriate.
As funny as that is somewhere there was a team of developers and server admins freaking out that things were going haywire on them. These were almost certainly, very smart, very capable people who are exposed to the reality that modern software is a huge stack of extremely complex technologies, both hardware and software and that they can and will fail at some point. In fact, it’s fundamental assumption of setting up these services that everything will fail at some point and that you need to plan for that happening.
We spend a great deal of time and money on the setup we provide for services like Dromos and EZPaperTrail. We have explored every piece of software, every framework and library we use. We are constantly patching and updating them to ensure there are no known vulnerabilities or bugs in them. We have redundant hardware and software and monitoring at many levels to know the moment a hard drive fails or memory is running low. We plan, backup, ensure we can failover, and do everything we can to make sure that the services never go down.
That’s what it takes to provide reliable server based software and it’s a never ending effort. It’s also largely invisible and unnoticed until something goes wrong and, despite your best efforts, the system goes down. When that happens the team kicks into gear to diagnose and fix the problem as quickly as possible.
After the services are restored the team does a post-mortem. Discussing what went wrong, if it was preventable, how it can be prevented in the future, and what additional measures could be taken to ensure that a future failure would automatically be handled and the service would stay operational. This might require new hardware, additional monitoring software, or other measures. The goal it to turn a failure into an opportunity to learn and improve.
Hopefully the Apple team is able to do that after getting Maps back up and they can keep pushing forward to improve that service in the future.