The first articles in this series introduced Apache Spark, presented the traditional flow of a Spark applications, and reviewed the components that make Spark work and then we reviewed Spark’s distributed architecture to better understand how it operates across a cluster of machines and walked through setting up a Spark local working environment.
This article builds on that foundation by reviewing Spark’s support for processing streaming data: Spark Streaming.
Introduction to Spark Streaming
In the previous articles we learned that Apache Spark’s abstraction for data contained in various systems was called a resilient distributed dataset, or RDD. We were able to create a reference to a data source in the form of an RDD, apply transformations to that RDD, including merging RDDs for difference sources, and then execute actions that generate actionable results. In the example we built, we loaded a text file for a book, processed it, and then output our results.
Streams are a little different: we do not have one big piece of data we want to process, but rather we have data trickling in from the time we start listening until we stop processing. Stream processing in Spark is accomplished using the Spark Streaming extension.
The Spark Streaming programmer’s guide provides the following description of Spark Streaming:
“Spark Streaming is an extension to the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.”
This is shown graphically in figure 1.
A streaming process may receive messages published to Apache Kafka, may receive files uploaded to Amazon S3, or may receive messages tweeted to Twitter. All of these serve as input to your Spark Streaming application that can then be executed by the Spark engine to generate results.
When a Spark Streaming application receives a live input data stream it then divides that stream into small batches that are sent to the Spark Engine to generate batches of processed data, which is shown in figure 2.
To maintain consistency with the RDD model, Spark Streaming creates new high-level abstraction called a discretized stream, or DStream. A DStream references its source and contains a sequence of RDDs. All of this is to say that in order to use Spark Streaming, we create a DStream to our input data stream and it delivers batches of data in the form of RDDs, and then we process those RDDs just as we would process the RDDs discussed in the previous articles.
Getting Started with Spark Streaming
This column is not focused on development, but again I thought a simple example might help translate these abstract concepts into something more tangible, so let’s build an example that monitors a directory and, when a new file is added to that directory, it counts all of the unique words in that file. This is similar to the example in previous articles, but the difference is that this example runs indefinitely and processes files as they appear.
You can download the example source code from GitHub. The main source code file is src/main/scala/sparkstreaming/SparkStreamingExample.scala and is shown in listing 1.
Listing 1. SparkStreamingExample.scala
The application begins by creating a Spark configuration. Setting the master name to “local” means that the application is expected to run locally and use two threads You need at least two threads because one thread is used by Spark to manage its tasks and one thread is used to actually execute your task. It is common to define the master name to something like “local[*]”, which tells Spark to use all cores available on the machine, but if you happen to be running on a machine with a single core (such as a small or micro EC2 instance) then the application won’t run.
With a Spark configuration in hand, the next step is to create a Streaming Context. In this example we create one that runs every 15 seconds: Second(15). The streaming context is the primary interface into using Spark Streaming and we use the textFileStream() method to tell Spark Streaming to monitor the specified directory (“data”) at the configured interval (Seconds(15)) and return an input stream to a text file. The text file is automatically parsed into lines that we can operate on:
- Split the line into words
- Perform a map/reduce operation: map each word to a tuple of the word itself to a count of 1 and then reduce by grouping unique words together and count all of the words (add all of the 1’s together)
Finally, we print out the results so that we can see each word and its respective count.
WIth our operations defined we start the streaming context and then call its awaitTermination() method to tell the application to run until we terminate the app by pressing Ctrl-C.
Running the Example
The example above requires a little bit of setup to run. First, download and install Spark from the Apache Spark website. I chose to install the package for “Hadoop 2.6 or later”, you can install from source, but then you are going to need to compile it against a local version of Hadoop and that is a fair amount of work. Once you have downloaded and decompressed it, set an environment variable for SPARK_HOME, for example, my SPARK_HOME is defined as follows:
$ export SPARK_HOME=~/apps/spark-1.6.1-bin-hadoop2.6
You can test your installation by executing one of the examples that ships with Spark. Execute the following command from your SPARK_HOME directory:
$ ./bin/run-example SparkPi 10
If your installation was successful, you should see Spark startup, run its demo, and shutdown without displaying any error messages.
The application is written in Scala and uses the Scala Simple Build Tool (SBT) to build the project. Building and running this application is beyond the scope of this article, but here is a brief description for the adventurous.
The application can be built as follows:
$ sbt package
SBT has options like “clean” to clean up the environment, “compile” to compile the source code, and “package” to build the resultant JAR file that contains the compiled source code. In this case, SBT creates the following file:
The application can then be run by “submitting” it to Spark:
$ $SPARK_HOME/bin/spark-submit --class sparkstreaming.SparkStreamingExample target/scala-2.11/spark-streaming-example_2.11-1.0.jar
The logging is verbose, but once the application is running, copy a text file to your “data” directory (a directory named data relative to the directory you launched the application). For example, here is the file that I created in the data directory:
This is a test
This is another test
The logging showed the following output:
16/05/06 14:33:02 INFO DAGScheduler: Job 17 finished: print at SparkStreamingExample.scala:23, took 0.010863 s
Time: 1462559582000 ms
As you can see, the application output each unique word and count for each word. You can keep copying files to the data directory and, every 15 seconds, it will pick up the file,create RDDs to those files, and process those files.
This article continued our series on Apache Spark by reviewing Spark Streaming. It reviewed how Spark Streaming receives input and divides it into batches for processing and how those batches of input yield batches of results. Spark Streaming is a good solution for processing data that arrives on a semi-regular interval, such as new files being uploaded to HDFS or S3, messages published to Apache Kafka, or messages tweeted to Twitter.
For the adventurous, this article also presented a Scala example that created a directory monitor and processed files copied into that directory, performing a word count map/reduce operation.
The important thing to understand about Spark Streaming, from an operational perspective, is that Spark Streaming applications connect to an input source and run small batch processes as data arrives at those input sources. Performance and capacity will be determined by how frequently data arrives at those sources.
In the next article we will review another Spark-based technology: Spark SQL.