Large companies analyze massive amounts of data coming from various sources such as social nets, weblogs, or customers.
An important class of data analytics concerns large-scale set operations. Suppose you have two customer data sets A and B.
Set A contains all customers who bought in 2017. Set B contains all customers who bought in 2018.
Your boss asks you for all high-value customers that bought in both years. Easy, you filter out the customers from both sets who bought for more than $10,000 and intersect the sets A and B.
Doing these kinds of set operations requires you to have access to all data items in memory. But memory on a single machine is limited. Moreover, filtering out a subset of customers can be slow for large datasets. It could be parallelized easily but not on a single machine.
The solution proposed by the Apache Spark system is to use the memory of multiple machines to store the large sets. So Spark distributes a single data set over multiple machines. These machines then work together executing the set operations in parallel.
What Does Apache Spark Do?
The distributed data analysis system Apache Spark enables users to perform large-scale data processing tasks on Big Data. The system facilitates easy distribution among multiple machines to accelerate processing. To the programmer using Spark, it provides a simple API for set-based computations such as
union(), and many more.
Apache Spark Operations Overview
Next, we explain selected Spark set operations that can be applied to an RDD Y. Spark calls these set operations transformations.
Y.map(f): Returns a new RDD by applying function f to each RDD element
Y.reduceByKey(f): Aggregates all (K,V) pairs with the same key K to a single (K,V) as specified by function f having the form f(V,V) → V
Y.filter(f): Creates a new RDD containing only elements for which f returns true.
Y.union(X): Creates a new RDD with elements that are either in X or in Y.
Y.intersection(X): Creates a new RDD with elements that are in both, X and Y.
Set up the Spark cluster with many worker machines once. After this, you can simply create sets and do some set operations as if the sets are on a single machine. This is very convenient for the programmer. It hides the complexity of Spark being a distributed system.
To store each data set, Spark uses a new data structure called Resilient Distributed Datasets (RDDs).
RDDs are distributed across multiple worker machines. RDDs can be only read but not modified. When you need to modify an RDD, you must create a modified copy of the old RDD.
RDDs are failure tolerant. If one or more of your machines fail, there is enough information to reconstruct each RDD. Spark does not store each version of the RDD on stable storage. This would result in huge overhead. Instead, Spark stores the lineage, i.e., the operations that have lead to the creation of each RDD.
With this powerful set representation, you can implement complex algorithms on large-scale data. Examples are machine learning algorithms like logistic regression or alternating least squares.
The runtime performance is about 10x better than that of the popular MapReduce system.
Want to learn Spark? It also has a Python API. Visit our Finxter web app to test and train your Python skills.
How to Install Spark on Python?
Spark is available in Python! To install it on your computer, run
pip install pyspark. This install the Python library PySpark.
$ pip install pyspark
Getting Started with Python
To get started with Spark, run an interactive session:
This assumes that
pyspark is installed on your computer.
The Python API is quite simple to use. Here’s an example from the official docs reading a text file:
>>> df = spark.read.text("README.md")
This stores the contents of the file
"README.md" in a DataFrame
df. You can now run operations on this DataFrame.
For example, to count the number of rows of the DataFrame, use
>>> df.count() 126
While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com. He’s author of the popular programming book Python One-Liners (NoStarch 2020), coauthor of the Coffee Break Python series of self-published books, computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.