At any company that leverages technology there is invariably a disconnect between the development and operations teams (hence the rise of DevOps). Both are driven by the success of the business and have a role to play; but each team views its role, its success criteria and its goals differently based on the unique challenges each must face, particularly in a world increasingly defined in software.
The development organization faces the following challenges:
- Scalability: how do I architect my application that can run on 100s of machines just as easily as on one?
- Performance: what do I have to do to manage the performance of my application and meet my SLAs?
- Testability: how do I design my application so that it can easily be unit tested by developers and integration tested by QA?
- Extensibility: what design patterns do I choose so that my application can easily be enhanced to meet changing business requirements?
- Diagnostic: what do I have to put in my application so that I can quickly identify the root cause of both performance and functional issues?
- Release process: how quickly can I release a version of my application, from the time I check in code until it is running in production?
- Code Quality: how can I develop and test code to minimize defects?
Operations, on the other hand, faces the different challenges:
- Availability: are all of the applications available? What strategies have I implemented to minimize application outages?
- Load Management: how do I allocate enough resources to satisfy current load? And how can I dynamically change my environment to handle peak load?
- Diagnostics: when there is a problem in an environment with multiple virtual machines running on the same physical machines, how do I determine the offending application(s)?
- Monitoring: I need insight to understand how well my environment is behaving
- Cost management: when running in a virtual environment I need to minimize my cost while maintaining my application performance
- SLAs: I have availability and performance requirements that I need to monitor, manage, and maintain
So both development and operations are working towards the same goal: the betterment of the company, but both organizations have different priorities and see the world differently. Or, as a friend of mine has said, it is almost as if developers are from Venus and operations are from Mars – and how can we help the two of them coexist?
This article is first in a series that presents a view of the developer / operations gap from an architect / developer perspective. As an architect for the past 14 years and a developer before, I have in-depth knowledge of the development side of this equation, so I’d like to present this topic as “here is what every developer wished that operations knew so that we can work better together”. Or stated another way, as a developer, here is the list of things that I need from operations so we each can do our jobs more efficiently to grow the business that may end up putting our kids through college – or hopefully more.
What every architect (& dev) wished ops knew about … Application Scalability
We may spend months or even years building a new application – or maybe just weeks developing a new feature for an existing application. We choose the right design patterns, we optimize our code, we ensure that the quality and performance are outstanding. Now we deliver our masterpiece to operations to deploy to a production environment and manage it from there out. As we return to our developer cave to build our next feature, we have certain hopes for how operations will treat our application.
In order for operations to be effective at deploying and managing our application, however, they need to understand our application from our perspective. In this article I review application scalability, and what operations needs to know about how we build scalable applications – as well as some of the inherent challenges and compromises we might need to make along the way.
Performance and Scalability: Two sides of a coin
Sometimes people group performance and scalability together, but in actuality they are different sides to the same problem. Performance measures something: such as, how quickly an application responds to a request or how much CPU is used to satisfy a request.
Scalability, on the other hand, answers the question: how well does the application maintain its performance as load increases. For example, if we measure the response time (performance) of a request to be 1 second, scalability asks for the response time when there are 100 or 1000 requests running simultaneously. If the response time remains close to 1 second at 1000 requests then the scalability is good, but if the response time increases dramatically then the scalability is… well, you all have lived that experience at least once, right?…
Architecting a scalable application is not an easy task, but there are some core principles that can make one successful. First and foremost, the application should be as stateless as possible, meaning that the application does not remember any user state between requests. If a component is stateless then it does not matter which component instance is invoked, they are all the same. The benefit is that stateless components can run just as well on 100 machines as they do on one, all without any complicated configuration. If I tell operations that a component is statelessly scalable then operations is free to run it on as many machines as is necessary without any other configuration.
Stateless-ness is a great goal and, if you partition your application appropriately, much of your application can be stateless, but any meaningful application is going to maintain user state at some point. For example, if you log into a web site, the web site needs to remember who you are between requests. If one machine records your state (typically referred to as a session) then all of your subsequent requests need to be sent to that machine.
In this configuration, we would ask operations to ensure that “sticky sessions” are turned on in the load balancer, which means that once a session is established then all subsequent requests are “stuck” to that one machine. Keeping the user’s session on one machine means that subsequent requests will already know who the user is and what he is doing; sending the user to a different machine either means that the server won’t know who he is or what he is doing or that the session will have to be replicated to that server, which is the topic of the next section.
Unfortunately, stickiness may not be enough because of resiliency. What happens if that one machine goes down? If the user’s session is on that machine then the user needs to re-login, which is a sub-optimal user experience. There are strategies to make an application more resilient, some of which have little or no impact on scalability, and some that have profound impacts on scalability. And the choice we make will affect the way that operations manages our application. The strategies include:
- Session Replication (primary/secondary or more)
- Database Lookup
- Shared Data Store
- Rich Cookies
- Terracotta Server Array
- Distributed Cache
Session Replication is the most common resiliency implementation built into most application servers. In this model, when a user’s session is changed, the session object is serialized and sent to one or more secondary servers. If the server goes down then the load balancer is configured to redirect load to the secondary server. In a simple model each primary server has a secondary server, which handles most outages. But if both servers go down at the same time then the user’s session is lost. This can be mitigated by maintaining more that one secondary server, but the more secondary servers you have, the more work you have to do and the more overhead you incur.
For example, if you replicate your data to five servers then, for every change, you need to serialize the session and send it across the network to five different servers. Session replication can dramatically reduce scalability because of the additional overhead in maintaining session replicas. In this situation we want operations to be aware of these fail-over rules and to maintain only as many servers as necessary. Furthermore, we do not want these servers to be elastic (scale up and scale down to meet load) because servers would need to be strategically shutdown in a precise order to guarantee that session data is not lost.