Large companies analyze massive amounts of data coming from various sources such as social nets, weblogs, or customers.
An important class of data analytics concerns large-scale set operations. Suppose you have two customer data sets A and B.
Set A contains all customers who bought in 2017. Set B contains all customers who bought in 2018.
Your boss asks you for all high-value customers that bought in both years. Easy, you filter out the customers from both sets who bought for more than $10,000 and intersect the sets A and B.
Doing these kinds of set operations requires you to have access to all data items in memory. But memory on a single machine is limited. Moreover, filtering out a subset of customers can be slow for large datasets. It could be parallelized easily but not on a single machine.
The solution proposed by the Apache Spark system is to use the memory of multiple machines to store the large sets. So Spark distributes a single data set over multiple machines. These machines then work together executing the set operations in parallel.
Next, we explain selected Spark set operations that can be applied to an RDD Y. Spark calls these set operations transformations.
Y.map(f): Returns a new RDD by applying function f to each RDD element
Y.reduceByKey(f): Aggregates all (K,V) pairs with the same key K to a single (K,V) as specified by function f having the form f(V,V) → V
Y.filter(f): Creates a new RDD containing only elements for which f returns true.
Y.union(X): Creates a new RDD with elements that are either in X or in Y.
Y.intersection(X): Creates a new RDD with elements that are in both, X and Y.
Set up the Spark cluster with many worker machines once. After this, you can simply create sets and do some set operations as if the sets are on a single machine. This is very convenient for the programmer. It hides the complexity of Spark being a distributed system.
To store each data set, Spark uses a new data structure called Resilient Distributed Datasets (RDDs).
RDDs are distributed across multiple worker machines. RDDs can be only read but not modified. When you need to modify an RDD, you must create a modified copy of the old RDD.
RDDs are failure tolerant. If one or more of your machines fail, there is enough information to reconstruct each RDD. Spark does not store each version of the RDD on stable storage. This would result in huge overhead. Instead, Spark stores the lineage, i.e., the operations that have lead to the creation of each RDD.
With this powerful set representation, you can implement complex algorithms on large-scale data. Examples are machine learning algorithms like logistic regression or alternating least squares.
The runtime performance is about 10x better than that of the popular MapReduce system.
Want to learn Spark? It also has a Python API. Visit our Finxter web app to test and train your Python skills.