Storage: elusive QoS levels
Today we will look at agility in the storage environment. As usual, it has its specific challenges and it is mostly related to guarantee quality of service (QoS). We reviewed some of the pain points in my previous article on Hungry peaking neighbors and how difficult it is to separate concurrent peaking loads across multiple storage tiers.
So what do people do to implement at least some QoS guarantee in storage? As usual, divide and conquer. To provide 3 classes of service, Gold, Silver, Bronze they usually divide all their storage devices to the 3 separate tiers and place the load accordingly. There is also a tendency to use better performing devices in higher service class tiers. E.g. Gold will likely use SSD and high speed SAS whereas Silver and Bronze may rely on slower SATA. This tiering could be implemented as workload placement constraints based on customer, service agreement etc.
However, this is not as simple as it may sound. First of all, you need to know how much capacity you need to allocate to every tier and for that you need to know workload demand. It might be possible to use some average IOPS and space consumption and then come up with some average density numbers which would represent best practices in delivering reliable service. On average it could look good – but it may result in average service.
As we already know, load fluctuates and what is good for average may not be good for peaks. Providing for the sum of the peaks may lead to serious over-provisioning which could be a price to pay for higher QoS.
But this is not the only issue. By doing this manual tier constraints, you seriously limit your agility and ability to provide capacity on demand when the load needs it and you still don’t have any way to control quality. For example, a very important quality parameter in storage is latency. You can measure it and be alerted when it is above the threshold but it would mean you already violated your service level and you still don’t know what to do.
Wouldn’t it be better if we you could just group your workload per class of service explicitly stating the service levels you want to achieve? E.g, Gold storage latency should be within 10ms, Silver – within 20 ms, Bronze – within 50 ms. And then share your entire storage pool and just move the load around to stay within these limits?
It is very difficult, you need to continuously watch the workload demand and it is fluctuations, you need to watch the latency trends not letting them to violate QoS and act before it is violated. Manually it is practically impossible, you need an intelligent software solution for that. And until you have it, you have to limit your agility by manually implementing service tiers. You spend more money than needed – and don’t guarantee QoS anyway. Do you know how to do it better?
Image source: Isla Fisher dealing with watertank storage management challenges caused by manual constraints in Now You See Me