CueSheet – Easy spark application deployment guide

CueSheet is a framework for writing Apache Spark 2.x applications more conveniently, designed to neatly separate the concerns of the business logic and the deployment environment, as well as to minimize the usage of shell scripts which are inconvenient to write and do not support validation. To jump-start, check out cuesheet-starter-kit which provides the skeleton for building CueSheet applications. CueSheet is featured in Spark Summit East 2017.

An example of a CueSheet application is shown below. Any Scala object extending CueSheet becomes a CueSheet application; the object body can then use the variables like sc, sqlContext, and spark to write the business logic, as if it is inside spark-shell:

import com.kakao.cuesheet.CueSheet

object Example extends CueSheet {{
  val rdd = sc.parallelize(1 to 100)
  println(s"sum = ${rdd.sum()}")
  println(s"sum2 = ${rdd.map(_ + 1).sum()}")
}}

CueSheet will take care of creating SparkContext or SparkSession according to the configuration given in a separate file, so that your application code can contain just the business logic. Furthermore, CueSheet will launch the application locally or to a YARN cluster by simply running your object as a Java application, eliminating the need to use spark-submit and accompanying shell scripts.

CueSheet also supports Spark Streaming applications, via ssc. When it is used in the object body, it automatically becomes a Spark Streaming application, and ssc provides access to the StreamingContext.

Importing CueSheet

libraryDependencies += "com.kakao.cuesheet" %% "cuesheet" % "0.10.0"

CueSheet can be used in Scala projects by configuring SBT as above. Note that this dependency is not specified as"provided", which makes it possible to launch the application right in the IDE, and even debug using breakpoints in driver code when launched in client mode.

Configuration

Configurations for your CueSheet application, including Spark configurations and the arguments in spark-submit, are specified using the HOCON format. It is by default application.conf in your classpath root, but an alternate configuration file can be specified using -Dconfig.resource or -Dconfig.file. Below is an example configuration file.

spark {
  master = "yarn:classpath:com.kakao.cuesheet.launcher.test"
  deploy.mode = cluster

  hadoop.user.name = "cloudera"

  executor.instances = 2
  executor.memory = 1g
  driver.memory = 1g

  streaming.blockInterval = 10000
  eventLog.enabled = false
  eventLog.dir = "hdfs:///user/spark/applicationHistory"
  yarn.historyServer.address = "http://history.server:18088"

  driver.extraJavaOptions = "-XX:MaxPermSize=512m"
}

Unlike the standard spark configuration, spark.master for YARN should include an indicator for finding YARN/Hive/Hadoop configurations. It is the easiest to put the XML files inside your classpath, usually by putting them undersrc/main/resources, and specify the package classpath as above. Alternatively, spark.master can contain a URL to download the configuration in a ZIP file, e.g. yarn:http://cloudera.manager/hive/configuration.zip, copied from Cloudera Manager’s ‘Download Client Configuration’ link. The usual local or local[8] can also be used asspark.master.

deploy.mode can be either client or cluster, and spark.hadoop.user.name should be the username to be used as the Hadoop user. CueSheet assumes that this user has the write permission to the home directory.

Using HDFS

While submitting an application to YARN, CueSheet will copy Spark and CueSheet’s dependency jars to HDFS. This way, in the next time you submit your application, CueSheet will analyze your classpath to find and assemble only the classes that are not part of the already installed jars.

One-Liner for Easy Deployment

When given a tag name as system property cuesheet.install, CueSheet will print a rather long shell command which can launch your application from anywhere hdfs command is available. Below is an example of the one-liner shell command that CueSheet produces when given -Dcuesheet.install=v0.0.1 as a JVM argument.

rm -rf SimpleExample_2.10-v0.0.1 && mkdir SimpleExample_2.10-v0.0.1 && cd SimpleExample_2.10-v0.0.1 &&
echo '<configuration><property><name>dfs.ha.automatic-failover.enabled</name><value>false</value></property><property><name>fs.defaultFS</name><value>hdfs://quickstart.cloudera:8020</value></property></configuration>' > core-site.xml &&
hdfs --config . dfs -get hdfs:///user/cloudera/.cuesheet/applications/com.kakao.cuesheet.SimpleExample/v0.0.1/SimpleExample_2.10.jar \!SimpleExample_2.10.jar &&
hdfs --config . dfs -get hdfs:///user/cloudera/.cuesheet/lib/0.10.0-SNAPSHOT-scala-2.10-spark-2.1.0/*.jar &&
java -classpath "*" com.kakao.cuesheet.SimpleExample "hello" "world" && cd .. && rm -rf SimpleExample_2.10-v0.0.1

What this command does is to download the CueSheet and Spark jars as well as your application assembly from HDFS, and launch the application in the same environment that was launched in the IDE. This way, it is not required to haveHADOOP_CONF_DIR or SPARK_HOME properly installed and set on every node, making it much easier to use it in distributed schedulers like Marathon, Chronos, or Aurora. These schedulers typically allow a single-line shell command as their job specification, so you can simply paste what CueSheet gives you in the scheduler’s Web UI.

Additional Features

Being started as a library of reusable Spark functions, CueSheet contains a number of additional features, not in an extremely coherent manner. Many parts of CueSheet including these features are powered by Mango library, another open-source project by Kakao.

One additional quirk is the “stop” tab CueSheet adds to the Spark UI. As shown below, it features three buttons with an increasing degree of seriousness. To stop a Spark Streaming application, to possibly trigger a restart by a scheduler like Marathon, one of the left two buttons will do the job. If you need to halt a Spark application ASAP, the red button will immediately kill the Spark driver.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s