In this series I’d like to touch several very important topics which are way too often assumed, taken for granted or misunderstood. When we refer to the quality of service (QoS) , service levels, QoS assurance, QoS guarantee etc. what exactly do we mean by that? Here is the definition from Wikipedia:
Quality of service (QoS) is the overall performance of a telephony or computer network, particularly the performance seen by the users of the network. To quantitatively measure quality of service, several related aspects of the network service are often considered, such as error rates, bandwidth, throughput, transmission delay, availability, jitter, etc.
First of all, we see where this definition comes from: from telephony and computer networks, the areas which were the probably the first where a notion of quality became important. We expect certain quality of voice services and the telephony and later computer networks became the conduits of these services. Second, while we talk about quality it is always quantitative.
It could and should be measured, communicated, compared. Third, it is not a single parameter or attribute, depending on services, the quality parameters could be quite different. While end users may treat quality subjectively (“You are breaking up”), there are always numbers behind it (degree of jitter, error rates, number of dropped packets etc).
So if there are metrics which could be measured then we could monitor the quality, identify deviations and then correct it, right? Only if it was that easy. Let’s start with measuring, probably the most advanced area of managing quality.
First, which particular parameter(s) to measure? If it is a voice service then everything which impacts clarity of the voice plays a role. E.g., number of dropped packets or high error rates cause re-transmisson of packets (in telephony it could manifest itself as a clipped speech, “breaking up” etc) and depending of degree of interference jitter may occur, I.e. variance of transmission latency which will cause unpleasant delays or echo effects.
So, one could start measuring the number of dropped packets at ingress or egress points and then derive some quality parameters. But packets travel across multiple hops and even if you have some bad network path, you can have a redundant route, a large buffer etc which may compensate for lost packets and the end user quality will be acceptable. So, not every parameter which impacts quality can be easily interpreted and acted upon.
So, to get closer to an end user it could be important to measure communication delays between end points – where the service originates and where it is delivered. But then you have to establish the collection points at practically every end point (hint: try to guess how many voice end points exist in today phone or computer networks).
But even we overcome the challenge of scale and collected all this data how to use them to guarantee quality? If you look for the most reliable computer network provider examine carefully what they put into their service level agreements. Usually it is availability (these days measuring and guaranteeing uptime is relatively easy), throughput (again, dedicating line bandwidth can be accomplished even if expensive) or error rate (measured at some egress network points).
Do these measurement guarantee quality? Only to certain extent but very rarely a provider will guarantee a fixed response time end-to-end. Why? Because there are too many factors impacting delays and a provider cannot be responsible for say a bug in the communication application where a sloppy coder forgot to remove a debugging wait loop causing 1 sec delay between every packet or lack of memory in virtual machine causing swapping and huge delays.
And even if you agree to narrow the delay measurements to some well-defined portion of the environment, once you see a deviation from the norm what would you do? If you start troubleshooting trying to figure out what is causing the delays it is too late, the quality is already degraded. Even if you figure out why (which is extremely difficult, root cause analysis is practically impossible in modern systems), there could be a multitude of actions to correct it and every such action may have a trickle down effect to other parts of the system. E.g. you could conclude that the bandwidth of a port channel group is not sufficient to accommodate all the packets demanded by an application, you added ports to the group just to hit the lack of available memory for buffering these packets in a virtual machine running this application. This may cause even more dramatic delays then just dropped packets.
So, let’s think. Is guaranteeing quality of service possible at all? Does the word “guarantee” correctly describe the expectations of the service levels? Who is responsible for delivering the expected quality end-to-end? .
In the next several posts we would dive deeper into these challenges which should also help us to identify the solutions.
Image sources: http://en.wikipedia.org/wiki/File:The_Unbearable_Lightness_of_Being.jpg , Tristan Cobb, Turbonomic