How Applications are Executed on a Spark Cluster
- Anatomy of a Spark Application
- Spark Applications Using the Standalone Scheduler
- Deployment Modes for Spark Applications Running on YARN
- Summary
Before you begin your journey as a Spark programmer, you should have a solid understanding of the Spark application architecture and how applications are executed on a Spark cluster. This chapter closely examines the components of a Spark application, looks at how these components work together, and looks at how Spark applications run on Standalone and YARN clusters.
Save 35% off the list price* of the related book or multi-format eBook (EPUB + MOBI + PDF) with discount codeARTICLE.
* Seeinformit.com/terms
It is not the beauty of a building you should look at; it’s the construction of the foundation that will stand the test of time.
David Allan Coe, American songwriter
In This Chapter:
Detailed overview of the Spark application and cluster components
Spark resource schedulers and Cluster Managers
How Spark applications are scheduled on YARN clusters
Spark deployment modes
Before you begin your journey as a Spark programmer, you should have a solid understanding of the Spark application architecture and how applications are executed on a Spark cluster. This chapter closely examines the components of a Spark application, looks at how these components work together, and looks at how Spark applications run on Standalone and YARN clusters.
Anatomy of a Spark Application
A Spark application contains several components, all of which exist whether you’re running Spark on a single machine or across a cluster of hundreds or thousands of nodes.
Each component has a specific role in executing a Spark program. Some of these roles, such as the client components, are passive during execution; other roles are active in the execution of the program, including components executing computation functions.
The components of a Spark application are theDriver,Master,Cluster Manager, and theExecutor(s), which run on worker nodes, orWorkers.Figure 3.1shows all the Spark components in the context of a Spark Standalone application. You will learn more about each component and its function in more detail later in this chapter.
Figure 3.1Spark Standalone cluster application components.
All Spark components, including the Driver, Master, and Executor processes, run in Java virtual machines (JVMs). A JVM is a cross-platform runtime engine that can execute instructions compiled into Java bytecode. Scala, which Spark is written in, compiles into bytecode and runs on JVMs.
It is important to distinguish between Spark’s runtime application components and the locations and node types on which they run. These components run in different places using different deployment modes, so don’t think of these components in physical node or instance terms. For instance, when running Spark on YARN, there would be several variations of Figure 3.1. However, all the components pictured are still involved in the application and have the same roles.
Spark Driver
The life of a Spark application starts and finishes with the Spark Driver. The Driver is the process that clients use to submit applications in Spark. The Driver is also responsible for planning and coordinating the execution of the Spark program and returning status and/or results (data) to the client. The Driver can physically reside on a client or on a node in the cluster, as you will see later.
SparkSession
The Spark Driver is responsible for creating theSparkSession. The SparkSession object represents a connection to a Spark cluster. The SparkSession is instantiated at the beginning of a Spark application, including the interactive shells, and is used for the entirety of the program.
Prior to Spark 2.0, entry points for Spark applications included the SparkContext, used for Spark core applications; the SQLContext and HiveContext, used with Spark SQL applications; and the StreamingContext, used for Spark Streaming applications. The SparkSession object introduced in Spark 2.0 combines all these objects into a single entry point that can be used for all Spark applications.
Through its SparkContext and SparkConf child objects, the SparkSession object contains all the runtime configuration properties set by the user, including configuration properties such as the Master, application name, number of Executors, and more.Figure 3.2显示了SparkSession对象和它的一些设计iguration properties within apysparkshell.
Figure 3.2SparkSession properties.
Listing 3.1 demonstrates how to create a SparkSession within a non-interactive Spark application, such as a program submitted usingspark-submit.
Listing 3.1Creating a SparkSession
from pyspark.sql import SparkSession spark = SparkSession.builder .master("spark://sparkmaster:7077") .appName("My Spark Application") .config("spark.submit.deployMode", "client") .getOrCreate() numlines = spark.sparkContext.textFile("file:///opt/spark/licenses") .count() print("The total number of lines is " + str(numlines))
Application Planning
One of the main functions of the Driver is to plan the application. The Driver takes the application processing input and plans the execution of the program. The Driver takes all the requestedtransformations(data manipulation operations) andactions(requests for output or prompts to execute programs) and creates adirected acyclic graph(DAG) ofnodes, each representing a transformational or computational step.
A Spark application DAG consists oftasksandstages. A task is the smallest unit of schedulable work in a Spark program. A stage is a set of tasks that can be run together. Stages are dependent upon one another; in other words, there arestage dependencies.
在进程调度的意义上,熟练的技艺并不是唯一的to Spark. For instance, they are used in other Big Data ecosystem projects, such as Tez, Drill, and Presto for scheduling. DAGs are fundamental to Spark, so it is worth being familiar with the concept.
Application Orchestration
The Driver also coordinates the running of stages and tasks defined in the DAG. Key driver activities involved in the scheduling and running of tasks include the following:
Keeping track of available resources to execute tasks
Scheduling tasks to run “close” to the data where possible (the concept of data locality)
Other Functions
In addition to planning and orchestrating the execution of a Spark program, the Driver is also responsible for returning the results from an application. These could be return codes or data in the case of an action that requests data to be returned to the client (for example, an interactive query).
The Driver also serves the application UI on port 4040, as shown inFigure 3.3. This UI is created automatically; it is independent of the code submitted or how it was submitted (that is, interactive usingpysparkor non-interactive usingspark-submit).
Figure 3.3Spark application UI.
If subsequent applications launch on the same host, successive ports are used for the application UI (for example, 4041, 4042, and so on).
Spark Workers and Executors
Spark Executors are the processes on which Spark DAG tasks run. Executors reserve CPU and memory resources on slave nodes, or Workers, in a Spark cluster. An Executor is dedicated to a specific Spark application and terminated when the application completes. A Spark program normally consists of many Executors, often working in parallel.
Typically, a Worker node—which hosts the Executor process—has a finite or fixed number of Executors allocated at any point in time. Therefore, a cluster—being a known number of nodes—has a finite number of Executors available to run at any given time. If an application requires Executors in excess of the physical capacity of the cluster, they are scheduled to start as other Executors complete and release their resources.
As mentioned earlier in this chapter, JVMs host Spark Executors. The JVM for an Executor is allocated aheap, which is a dedicated memory space in which to store and manage objects. The amount of memory committed to the JVM heap for an Executor is set by the propertyspark.executor.memoryor as the--executor-memoryargument to thepyspark,spark-shell, orspark-submitcommands.
Executors store output data from tasks in memory or on disk. It is important to note that Workers and Executors are aware only of the tasks allocated to them, whereas the Driver is responsible for understanding the complete set of tasks and the respective dependencies that comprise an application.
By using the Spark application UI on port 404x驱动程序的主机,您可以检查执行者the application, as shown inFigure 3.4.
Figure 3.4Executors tab in the Spark application UI.
For Spark Standalone cluster deployments, a worker node exposes a user interface on port 8081, as shown inFigure 3.5.
Figure 3.5Spark Worker UI.
The Spark Master and Cluster Manager
The Spark Driver plans and coordinates the set of tasks required to run a Spark application. The tasks themselves run in Executors, which are hosted on Worker nodes.
The Master and the Cluster Manager are the central processes that monitor, reserve, and allocate the distributed cluster resources (or containers, in the case of YARN or Mesos) on which the Executors run. The Master and the Cluster Manager can be separate processes, or they can combine into one process, as is the case when running Spark in Standalone mode.
Spark Master
The Spark Master is the process that requests resources in the cluster and makes them available to the Spark Driver. In all deployment modes, the Master negotiates resources or containers with Worker nodes or slave nodes and tracks their status and monitors their progress.
When running Spark in Standalone mode, the Spark Master process serves a web UI on port 8080 on the Master host, as shown inFigure 3.6.
Figure 3.6Spark Master UI.
Cluster Manager
The Cluster Manager is the process responsible for monitoring the Worker nodes and reserving resources on these nodes upon request by the Master. The Master then makes these cluster resources available to the Driver in the form of Executors.
As discussed earlier, the Cluster Manager can be separate from the Master process. This is the case when running Spark on Mesos or YARN. In the case of Spark running in Standalone mode, the Master process also performs the functions of the Cluster Manager. Effectively, it acts as its own Cluster Manager.
A good example of the Cluster Manager function is the YARN ResourceManager process for Spark applications running on Hadoop clusters. The ResourceManager schedules, allocates, and monitors the health of containers running on YARN NodeManagers. Spark applications then use these containers to host Executor processes, as well as the Master process if the application is running in集群mode; we will look at this shortly.