Ops Are from Mars, Devs Are from Venus: Orchestrating your Big Data Environment

January 9th, 2015 by

Spock is to Devs as Scotty is to OpsWe’ve established that we are going to be running our Big Data platform. We have our tools chosen to operate our Big Data platform thanks to the work we have done with our Dev team to select the best platform for our purposes. So, what’s next for the Ops team to move to the next steps in building out the Big Data infrastructure? Orchestration!

Recall that one of the enablers of our DevOps methodology is the use of orchestration and automated testing to build out our application infrastructure. Our Big Data infrastructure should be no different.

Don’t hesitate, orchestrate

If we pick Hadoop as an example, there are already a fair number of tools that have been highlighted on the Hadoop Ecosystem page (http://hadoopecosystemtable.github.io/) which are listed as System Deployment. Not only are the many different ways to use the Hadoop ecosystem products (Cassandra, HBase, Spark etc…), but once you attach your use-case to the product, we need to deploy and manage these tools.

We will highlight a couple of deployment and management frameworks to illustrate the tie-ins with our Hadoop environment to show some key aspects of what they are doing.


Using a 3-layered approach, Deploop is a build, and operational tools for a Hadoop environment. Along with the catchy name, it comes with a well laid out architecture.

From the deploop.github.io page:

  1. Batch layer: which functionality is managing the master dataset, which is an append-only set of raw data , and handle pre-computing arbitrary query functions, called batch views.

  2. Serving layer: indexes the batch views so that they can be queried in ad hoc with low latency.

  3. Speed layer: accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

The value that is added with the multiple layers is the ability to perform multiple functions from the same toolkit. First you would deploy your infrastructure using the batch layer, which allows you to build a Hadoop cluster from scratch. Once the cluster is deployed, you may start and stop the cluster and security services easily using the same tool.

With the Serving layer, Deploop provides indexes of the batch views as you’ve deployed already in order to provide ad hoc search queries with low latency due to the awareness of data locality.

The Speed layer is the final of the 3-layer architecture, which is designed to service the recent data requests for the lowest latency responses. Because this is dependent on data that has recently been written, the integration with other parts of the Deploop engine and catalog of index data.

You will notice from the diagram below that the orchestration engine is our often-discussed Puppet, a favorite tool for configuration management among many of the Ops teams already. This helps to build that additional comfort due to some familiarity with the engine that drives Deploop and potentially many other parts of your data center already.



(image source: http://deploop.github.io/images/puppet-mcollective.jpg)

Deploop is still under active development, so we will continue to see growth in the feature set over time. Just like with any of our Big Data environments, the use of our configuration management tool will be situation-dependent. The use-case for one may be strong, while in another Big Data deployment scenario there may be requirements for additional features that aren’t already available within our current orchestration environment. This is why it is important for us to be aware of other alternatives as well.

Apache Ambari

Along with Hadoop itself, another Apache project is Ambari, which could be the most full-featured of the Hadoop orchestration and systems management tools available to date. As it is under the Apache 2.0 licensing, it is another open sourced project that has many contributors and a very active development ecosystem.

Ambari uses a 3-tiered approach for functions including Provision, Manage, and Monitor. This is an extension to what we saw with Deploop because of the addition of a monitoring layer.

The modular architecture of the Ambari server shows separation of different features to allow for loosely coupled architecture with effective interdependencies using the coordinator API.

The server deploys neatly inside a VM with many different orchestrated deployments of the Ambari framework itself available. Once you’ve got your Ambari infrastructure up and running you can use the RESTful API (https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md) to get underway with deploying your Hadoop infrastructure.

As mentioned, the extension with adding Nagios for a monitoring layer to the Ambari ecosystem is a nice added touch. Using the pre-build Nagios, you can keep track of the state of affairs in your Hadoop environment for basic health information.

The Ambari system also comes with its own shell environment as well as a Python shell:


(image source: https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Shell)

Now that we are armed with a fully RESTful API, and a shell for interactive management, we can deploy, manage, and monitor our Hadoop clusters and create all of the orchestrated tie-ins to our existing configuration management systems as well.

Just a sample and many more to choose

There are other alternatives already available, and many more coming to orchestrate the Hadoop architecture. What this shows is that the focus has been strongly placed on programmatic management of a Hadoop system which is a key part of our DevOps culture to enable the development teams to quickly build and deploy, and for our operations teams to use the same methods to run QA and production infrastructure.

If you use another system other than Hadoop, there are also many options for each of those platforms. It goes without saying that this is a theme that is being adhered to in the industry now and going forward.

Big Data is coming, and whether you’re thinking you are ready for it or not, this is the ideal time to test the waters. Remember to think big about management rather than just data. Having an effective orchestration environment for the Big Data ecosystem at your organization could be the difference between success and challenged deployment.

Using an orchestrated environment we ensure the consistency needed to grow and shrink our environment as needed. As the Six Million Dollar Man character Oscar Goldman was famous for saying: “We can rebuild him, we have the technology.”

It seems that Oscar had a good orchestration environment to help him out.

Leave a Reply

Your email address will not be published. Required fields are marked *