Big Data is causing my Ops teams Big Headaches. Steven and the Dev team are doing a great job of adapting the development model to utilize Big Data as a data management option for the business, but the challenges are coming as the Ops teams are trying to understand what exactly drives the Big Data machine.
How Big is Big Data?
There is a difference between uppercase Big Data and lowercase big data. It’s not really a specific measurement, but think millions of objects and rows. Big Data is usually very big, but what is more important is that Big Data is like “cloud” in that it is a methodology as much as it is a technology.
Big Data is about the way in which data is stored, organized, and retrieved. The change in methodology is rather important because of the impact that it will have on our Ops teams when running a Big Data environment. Operational models differ entirely from what we have grown up with for databases up to this point.
Ops Lessons in Big Data versus RDBMS
The traditional “database” systems that we have come to know and love (read:despise), are almost always what we know as RDBMS, or Relational Database Management Systems. These include MySQL, Microsoft SQL, PostgreSQL and many others. You can see the similarity often by the presence of SQL (Sequential Query Language) as a part of the name.
Enter the next generation of data storage, which is the NoSQL movement. NoSQL is precisely what it sounds like, so this is a well-deserved name. Popular NoSQL products include MongoDB, CouchDB, Solr, and more. This move towards key-value stores, document databases, graph stores, and wide-column stores, shows the shift in how application designs are changing our requirements for back end data storage.
Once the applications began to move into this more scale-out architecture on both the front end and back end, we saw a fundamental change happening in how we host and manage infrastructure to support the new style of development. This is where we find an interesting challenge in front of us.
Your Storage Isn’t Ready for Big Data
This is a rather broad statement, but it is safe to say that the traditional storage models aren’t aligned with the requirements for operating and optimizing Big Data. Some of the key characteristics of Big Data include, but aren’t limited to these:
• Storage Tiers
The most common hosting platform for Big Data storage has often been NAS because of the ability to scale as required. Unfortunately, not every NAS will provide the performance and tiers that we may require. Let’s have a little review of each of these requirements.
It’s called Big Data for a reason. It will grow, potentially at a rapid pace. We are quickly finding out the Peta (Not the People for the Ethical Treatment of Animals one) is a prefix that we will see much more often. Where terabytes were once the high water mark of storage, we have moved beyond hundreds of terabytes and the volume of data is growing at a consistent and often rapid rate.
Big Data is no different than any other kind of data in one aspect, which is that it varies in style. The profile of data will be wildly different depending on the workload profile from moment to moment. Where some data is required infrequently, some will need to be moved onto fast, responsive flash tier storage to ensure lowest latency.
Storage systems for Big Data will not fit the model we support today in either the monolithic model, or the all-flash or traditional NAS model. There must be a hybrid approach which meets our scalability needs and also the multi-tier delivery for all use-cases.
Gene Kranz was famous as saying “Failure is not an option” during the Apollo 13 mission regarding the concern that the astronauts may not make it home. In that case, the goal was to overcome adversity and prove that failure was not going to be the outcome. This holds very true with regards to Big Data storage design.
Ed Harris as Gene Kranz in the movie Apollo 13 (Image: Universal Studios)
Storage architecture designed for resiliency is all about embracing failure. Failure is absolutely an option, but should not be felt by the applications. Hardware and software driven storage need to be architected for continuous awareness of health and to be able to dynamically recover and rebuild as a result of failures within the environment.
We’ve talked about tiered storage, and the real reason that is a requirement is performance. Performance requirements based on different use-cases will drive the need to have different performance tiers on the physical storage.
Ensuring that our data is delivered with low latency and high performance to the applications is the goal in order to provide the best consumer experience. Adapting our storage tools and infrastructure will be needed both up front, and on an ongoing basis as the environment grows.
Availability in the general sense is clearly important for our data, but what is equally, and often more important is the data locality based on which workload is using the data to render content. Whether this is a processing system that is doing calculations based on the data, or a simple search that will spread across a broad selection, the data locality can affect latency and application performance.
We know that by its design, scale-out storage to support Big Data does not necessarily take data locality into account during the distribution of objects across physical storage. For this reason, we have a lot to consider during infrastructure design and requirements gathering.
Which is the Best Big Data Platform?
The question of the day is this one. We have to work with our developers to really dive into the specific requirements in order to evaluate our options. This is our next step. It was important to get a sense of the basic needs of our Big Data platforms, so we are going to sit down with our Dev friends and evaluate the product ecosystem next. This is how we make the best decisions to move towards becoming a Big Data provider.