<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>rdd Archives - Be on the Right Side of Change</title>
	<atom:link href="https://blog.finxter.com/tag/rdd/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.finxter.com/tag/rdd/</link>
	<description></description>
	<lastBuildDate>Sun, 27 Mar 2022 13:45:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.finxter.com/wp-content/uploads/2020/08/cropped-cropped-finxter_nobackground-32x32.png</url>
	<title>rdd Archives - Be on the Right Side of Change</title>
	<link>https://blog.finxter.com/tag/rdd/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Apache Spark &#8211; A Short Overview</title>
		<link>https://blog.finxter.com/apache-spark/</link>
					<comments>https://blog.finxter.com/apache-spark/#respond</comments>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Sun, 03 Jan 2021 18:47:00 +0000</pubDate>
				<category><![CDATA[2-min Computer Science Papers]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Structures]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Python Set]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[rdd]]></category>
		<category><![CDATA[resilient distributed data sets]]></category>
		<category><![CDATA[spark]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=411</guid>

					<description><![CDATA[<p>Large companies analyze massive amounts of data coming from various sources such as social nets, weblogs, or customers. An important class of data analytics concerns large-scale set operations. Suppose you have two customer data sets A and B. Set A contains all customers who bought in 2017. Set B contains all customers who bought in ... <a title="Apache Spark &#8211; A Short Overview" class="read-more" href="https://blog.finxter.com/apache-spark/" aria-label="Read more about Apache Spark &#8211; A Short Overview">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/apache-spark/">Apache Spark &#8211; A Short Overview</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-image"><figure class="aligncenter"><a href="https://blog.finxter.com/wp-content/uploads/2018/07/Spark.png"><img fetchpriority="high" decoding="async" width="960" height="540" src="https://blog.finxter.com/wp-content/uploads/2018/07/Spark.png" alt="Spark System Example" class="wp-image-412" srcset="https://blog.finxter.com/wp-content/uploads/2018/07/Spark.png 960w, https://blog.finxter.com/wp-content/uploads/2018/07/Spark-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2018/07/Spark-768x432.png 768w" sizes="(max-width: 960px) 100vw, 960px" /></a></figure></div>



<p>Large companies analyze massive amounts of data coming from various sources such as social nets, weblogs, or customers.</p>



<p>An important class of data analytics concerns large-scale set operations. Suppose you have two customer data sets A and B.</p>



<p>Set A contains all customers who bought in 2017. Set B contains all customers who bought in 2018.</p>



<p>Your boss asks you for all high-value customers that bought in both years. Easy, you filter out the customers from both sets who bought for more than $10,000 and intersect the sets A and B.</p>



<p>Doing these kinds of set operations requires you to have access to all data items in memory. But memory on a single machine is limited. Moreover, filtering out a subset of customers can be slow for large datasets. It could be parallelized easily but not on a single machine.</p>



<p>The solution proposed by the <a rel="noreferrer noopener" href="https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf" target="_blank">Apache Spark</a> system is to use the memory of multiple machines to store the large sets. So Spark distributes a single data set over multiple machines. These machines then work together executing the set operations in parallel.</p>



<h2 class="wp-block-heading">Related Video</h2>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="Niche - GraphX (Spark) Freelancer on Upwork ... to Rahul" width="937" height="527" src="https://www.youtube.com/embed/usUdxsSEAXA?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>You can learn more about the career opportunities of Spark developers in my detailed blog guide:</p>



<ul class="wp-block-list"><li><a href="https://blog.finxter.com/apache-spark-developer-income-and-opportunity/" data-type="post" data-id="259597" target="_blank" rel="noreferrer noopener">Apache Spark &#8212; Income and Opportunity</a></li></ul>



<h2 class="wp-block-heading">What Does Apache Spark Do?</h2>



<p>The distributed data analysis system Apache Spark enables users to perform large-scale data processing tasks on Big Data. The system facilitates easy distribution among multiple machines to accelerate processing. To the programmer using Spark, it provides a simple API for set-based computations such as <code>map()</code>, <code>reduce()</code>, <code>filter()</code>, <code>union()</code>, and many more.</p>



<h2 class="wp-block-heading">Apache Spark Operations Overview</h2>



<p>Next, we explain selected Spark set operations that can be applied to an RDD Y. Spark calls these set operations transformations.</p>



<ul class="wp-block-list"><li><code>Y.map(f)</code>: Returns a new RDD by applying function f to each RDD element</li><li><code>Y.reduceByKey(f)</code>: Aggregates all (K,V) pairs with the same key K to a single (K,V) as specified by function f having the form f(V,V) → V</li><li><code>Y.filter(f)</code>: Creates a new RDD containing only elements for which f returns true.</li><li><code>Y.union(X)</code>: Creates a new RDD with elements that are either in X or in Y.</li><li><code>Y.intersection(X)</code>: Creates a new RDD with elements that are in both, X and Y.</li></ul>



<p>Set up the Spark cluster with many worker machines once. After this, you can simply create sets and do some set operations as if the sets are on a single machine. This is very convenient for the programmer. It hides the complexity of Spark being a distributed system.</p>



<p>To store each data set, Spark uses a new data structure called <em>Resilient Distributed Datasets</em> (RDDs).</p>



<p>RDDs are distributed across multiple worker machines. RDDs can be only read but not modified. When you need to modify an RDD, you must create a modified copy of the old RDD.</p>



<p>RDDs are failure tolerant. If one or more of your machines fail, there is enough information to reconstruct each RDD. Spark does not store each version of the RDD on stable storage. This would result in huge overhead. Instead, Spark stores the lineage, i.e., the operations that have lead to the creation of each RDD.</p>



<p>With this powerful set representation, you can implement complex algorithms on large-scale data. Examples are machine learning algorithms like logistic regression or alternating least squares.</p>



<p>The runtime performance is about 10x better than that of the popular MapReduce system.</p>



<p>Want to learn Spark? It also has a Python API. Visit our <a href="https://finxter.com">Finxter web app</a> to test and train your Python skills.</p>



<h2 class="wp-block-heading">How to Install Spark on Python?</h2>



<p>Spark is available in Python! To install it on your computer, run <code>pip install pyspark</code>. This install the Python library <a href="https://pypi.org/project/pyspark/" target="_blank" rel="noreferrer noopener" title="https://pypi.org/project/pyspark/">PySpark</a>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="powershell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install pyspark</pre>



<h2 class="wp-block-heading">Getting Started with Python</h2>



<p>To get started with Spark, run an interactive session:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pyspark</pre>



<p>This assumes that <code>pyspark</code> is installed on your computer.</p>



<p>The Python API is quite simple to use. Here&#8217;s an example from the official docs reading a text file:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df = spark.read.text("README.md")</pre>



<p>This stores the contents of the file <code>"README.md"</code> in a DataFrame <code>df</code>. You can now run operations on this DataFrame.</p>



<p>For example, to count the number of rows of the DataFrame, use <code>df.count()</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> df.count()
126</pre>
<p>The post <a href="https://blog.finxter.com/apache-spark/">Apache Spark &#8211; A Short Overview</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.finxter.com/apache-spark/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Minified using Disk

Served from: blog.finxter.com @ 2026-06-06 05:27:26 by W3 Total Cache
-->