This week I was able to take some time to have discussions on data center and application development practices with a number of people ranging from developers, to sysadmins, to directors of technology teams. One thing that I was very pleased to be able to chat about is something that I made a point of introducing at a couple of organizations I’ve been at which is forced maintenance windows.
While it seems common to many of us running data center operations, there are still a lot of organizations and teams that have had difficulty in implementing this practice. Rebooting a server was once considered a risky task because servers were often left up as long as possible. This introduced the unnecessary risk of non-routine power cycling only when application seemed to be acting up.
Some of us have a long history and recall running Unix, SunOS, and other systems which were built to stay up for incredibly long periods of time. Along the way, distributed computing came along and the influx of Microsoft Windows servers began to add the need to run some more regular rebooting of servers due to operating system patch cycles. In recent years it has become so normal for us that it actually has a name.
Microsoft has created a very regular patching routine where the second Tuesday of every month is dubbed as Patch Tuesday. Each of these patch release days there are operating system and application patches that are made available. The week leading up to Patch Tuesday will have a list of KB articles published about upcoming patches and what the risks, vulnerabilities, affected systems, and potential mitigation strategies.
As a result of this very regular and predictable patch release cycle, it has become a fairly common practice to install all patches as they are released and to reboot the servers during the patch cycle. This also brings up an interesting challenge around how many have viewed Microsoft platforms. Sometimes it becomes a joke such as “have you tried rebooting it?” made famous by sysadmins everywhere and most notably used as a slogan for the show The IT Crowd.
So, if this is the case where Windows servers are so high maintenance according to the jokes of admins all over, that must mean that more people would run other operating systems to keep them online for longer periods, right? I install Linux machines and OS X machines regularly, and guess what’s been happening on those platforms much more often? Critical patches that require reboots.
Since we have to accept that the hosts and guest machines will need to be rebooted to do many of the patches that are necessary to reduce risk of vulnerabilities, it has become obvious that we need to rethink how we architect our applications.
Forced Failure Architecture: The Need for Application Resiliency
Using cloud resources introduced us to the start of this practice. Not only could your instance be rebooted, but it will not be on any cycle that you can plan around in advance. It happens with limited warning, and you don’t have the option to defer to another time because you have a critical application that needs to stay up on that instance.
This very same realization has entered the traditional data center and we have many application architects designing with resiliency in mind. With N-Tier application architecture, there are logical separations between front-end, middle tier layers such as message queuing, and back-end storage both on filesystems and databases. Scale-out applications require making each of these layers more generic and able to be increased or decreased on-demand. Not only on-demand, but they should be able to withstand a sudden loss of a portion of the underlying infrastructure.
We begin with the simple application design, quite often on a single server:
This was a working design, but needed an outage of the application during patching cycles and reboots. Obviously, we needed to build something that allowed for not only better patching, but better load management. This makes us make the next natural step towards building additional web servers and then leverage a load balancer to distributed the incoming connection across the web server farm:
Getting warmer, but we still had to extend even further. Next we could grow out our database infrastructure. This may lead us to not only adding database servers and some multi-master database replication, we could move up to add a more versatile Database-as-a-Service platform for truly resilient features:
We can keep growing every layer as needed, and as we add more scale-out capability, we have created a two-fold win for our application. Now we are able to add more server resources to handle more application load, and we are also creating the ability to add resiliency during partial outages.
This isn’t a small step, but luckily with the tools available today the shift towards resilient infrastructure is closer than ever before. This is the ideal time for sysadmins to work together with the development teams to find better application architecture options. Whether you’re diving in to Microsoft SQL with Always-On databases, or MariaDB, or potentially some of the popular NoSQL options such as Cassandra or Hadoop, there are no shortage of viable options.
It’s an exiting time, and as we move towards our more DevOps friendly culture in the IT organization, it is a good opportunity to build for failure. It may be ironic that one day we will thank Microsoft for making us so good at bringing servers down.