<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lukas Halim, Author at Be on the Right Side of Change</title>
	<atom:link href="https://blog.finxter.com/author/lukashalim/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.finxter.com/author/lukashalim/</link>
	<description></description>
	<lastBuildDate>Sun, 03 Apr 2022 16:41:51 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.finxter.com/wp-content/uploads/2020/08/cropped-cropped-finxter_nobackground-32x32.png</url>
	<title>Lukas Halim, Author at Be on the Right Side of Change</title>
	<link>https://blog.finxter.com/author/lukashalim/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Matplotlib Text and Annotate — A Simple Guide</title>
		<link>https://blog.finxter.com/matplotlib-text-and-annotate-a-simple-guide/</link>
		
		<dc:creator><![CDATA[Lukas Halim]]></dc:creator>
		<pubDate>Sat, 22 May 2021 20:02:22 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Matplotlib]]></category>
		<category><![CDATA[Python]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=30243</guid>

					<description><![CDATA[<p>You&#8217;d like to add text to your plot, perhaps to explain an outlier or label points. Matplotlib&#8216;s text method allows you to add text as specified coordinates. But if you want the text to refer to a particular point, but you don&#8217;t want the text centered on that point? Often you&#8217;ll want the text slightly ... <a title="Matplotlib Text and Annotate — A Simple Guide" class="read-more" href="https://blog.finxter.com/matplotlib-text-and-annotate-a-simple-guide/" aria-label="Read more about Matplotlib Text and Annotate — A Simple Guide">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/matplotlib-text-and-annotate-a-simple-guide/">Matplotlib Text and Annotate — A Simple Guide</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>You&#8217;d like to add text to your plot, perhaps to explain an outlier or label points. <a href="https://blog.finxter.com/matplotlib-full-guide/" target="_blank" rel="noreferrer noopener" title="Matplotlib — A Simple Guide with Videos">Matplotlib</a>&#8216;s text method allows you to add text as specified coordinates. But if you want the text to refer to a particular point, but you don&#8217;t want the text centered on that point? Often you&#8217;ll want the text slightly below or above the point it&#8217;s labeling. In that situation, you&#8217;ll want the <code>annotate</code> method. With annotate, we can specify both the point we want to label and the position for the label.</p>



<h2 class="wp-block-heading" id="Basic-text-method-example">Basic text method example</h2>



<p>Let&#8217;s start with an example of the first situation &#8211; we simply want to add text at a particular point on our plot. The text method will place text anywhere you&#8217;d like on the <a href="https://blog.finxter.com/matplotlib-how-to-change-subplot-sizes/" target="_blank" rel="noreferrer noopener" title="Matplotlib – How to Change Subplot Sizes">plot</a>, or even place text outside the plot. After the import statement, we pass the required parameters &#8211; the x and y coordinates and the text.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import matplotlib.pyplot as plt

x, y, text = .5, .5, "text on plot"

fig, ax = plt.subplots()
ax.text(x, y, text)
x, y, text = 1.3, .5, "text outside plot"
ax.text(x, y, text)</pre>



<pre class="wp-block-preformatted">Text(1.3, 0.5, 'text outside plot')</pre>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="554" height="252" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-28.png" alt="" class="wp-image-30251" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-28.png 554w, https://blog.finxter.com/wp-content/uploads/2021/05/image-28-300x136.png 300w" sizes="(max-width: 554px) 100vw, 554px" /></figure>



<h2 class="wp-block-heading" id="Changing-the-font-size-and-font-color">Changing the font size and font color</h2>



<p>We can customize the text position and format using optional parameters. The font itself can be customized using either a <code>fontdict</code> object or with individual parameters.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">x, y, text = .3, .5, "formatted with fontdict"
fontdict = {'family': 'serif', 'weight': 'bold', 'size': 16, 'color' : 'green'}
fig, ax = plt.subplots()
ax.text(x, y, text, fontdict=fontdict)
x, y, text = .2, .2, "formatted with individual parameters"
ax.text(x, y, text, fontsize = 12, color = 'red', fontstyle = 'italic')</pre>



<pre class="wp-block-preformatted">Text(0.2, 0.2, 'formatted with individual parameters')</pre>



<figure class="wp-block-image size-large"><img decoding="async" width="380" height="252" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-21.png" alt="" class="wp-image-30244" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-21.png 380w, https://blog.finxter.com/wp-content/uploads/2021/05/image-21-300x199.png 300w" sizes="(max-width: 380px) 100vw, 380px" /></figure>



<h2 class="wp-block-heading" id="How-to-change-the-text-alignment?">How to change the text alignment?</h2>



<p>We specify the <code>xy</code> coordinates for the text, but of course, the text can&#8217;t fit on a single point. So is the text centered on the point, or is the first letter in the text positioned on that point? Let&#8217;s see.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
ax.set_title("Different horizonal alignment options when x = .5")
ax.text(.5, .8, 'ha left', fontsize = 12, color = 'red', ha = 'left')
ax.text(.5, .6, 'ha right', fontsize = 12, color = 'green', ha = 'right')
ax.text(.5, .4, 'ha center', fontsize = 12, color = 'blue', ha = 'center')
ax.text(.5, .2, 'ha default', fontsize = 12)</pre>



<pre class="wp-block-preformatted">Text(0.5, 0.2, 'ha default')</pre>



<figure class="wp-block-image size-large"><img decoding="async" width="380" height="264" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-22.png" alt="" class="wp-image-30245" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-22.png 380w, https://blog.finxter.com/wp-content/uploads/2021/05/image-22-300x208.png 300w" sizes="(max-width: 380px) 100vw, 380px" /></figure>



<p>The text is left horizontal aligned by default. Left alignment positions the beginning of the text is on the specified coordinates. Center alignment positions the middle of the text on the xy coordinates. Right alignment positions the end of the text on the coordinates.</p>



<h2 class="wp-block-heading" id="Creating-a-text-box">Creating a text box</h2>



<p>The <code>fontdict</code> <a href="https://blog.finxter.com/python-dictionary/" target="_blank" rel="noreferrer noopener" title="Python Dictionary – The Ultimate Guide">dictionary </a>object allows you to customize the font. Similarly, passing the <code>bbox</code> dictionary object allows you to set the properties for a box around the text. Color values between 0 and 1 determine the shade of gray, with 0 being totally black and 1 being totally white. We can also use <code>boxstyle</code> to determine the shape of the box. If the <code>facecolor</code> is too dark, it can be lightened by trying a value of alpha closer to 0.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
x, y, text = .5, .7, "Text in grey box with\nrectangular box corners."
ax.text(x, y, text,bbox={'facecolor': '.9', 'edgecolor':'blue', 'boxstyle':'square'})
x, y, text = .5, .5, "Text in blue box with\nrounded corners and alpha of .1."
ax.text(x, y, text,bbox={'facecolor': 'blue', 'edgecolor':'none', 'boxstyle':'round', 'alpha' : 0.05})
x, y, text = .1, .3, "Text in a circle.\nalpha of .5 darker\nthan alpha of .1"
ax.text(x, y, text,bbox={'facecolor': 'blue', 'edgecolor':'black', 'boxstyle':'circle', 'alpha' : 0.5})</pre>



<pre class="wp-block-preformatted">Text(0.1, 0.3, 'Text in a circle.\nalpha of .5 darker\nthan alpha of .1')</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="380" height="252" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-31.png" alt="" class="wp-image-30254" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-31.png 380w, https://blog.finxter.com/wp-content/uploads/2021/05/image-31-300x199.png 300w" sizes="auto, (max-width: 380px) 100vw, 380px" /></figure>



<h2 class="wp-block-heading" id="Basic-annotate-method-example">Basic annotate method example</h2>



<p>Like we said earlier, often you&#8217;ll want the text to be below or above the point it&#8217;s labeling. We could do this with the text method, but annotate makes it easier to place text relative to a point. The annotate method allows us to specify two pairs of coordinates. One xy coordinate specifies the point we wish to label. Another xy coordinate specifies the position of the label itself. For example, here we plot a point at (.5,.5) but put the annotation a little higher, at (.5,.503).</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
x, y, annotation = .5, .5, "annotation"
ax.title.set_text = "Annotating point (.5,.5) with label located at (.5,.503)"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.003))</pre>



<pre class="wp-block-preformatted">Text(0.5, 0.503, 'annotation')</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="378" height="248" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-29.png" alt="" class="wp-image-30252" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-29.png 378w, https://blog.finxter.com/wp-content/uploads/2021/05/image-29-300x197.png 300w" sizes="auto, (max-width: 378px) 100vw, 378px" /></figure>



<h2 class="wp-block-heading" id="Annotate-with-an-arrow">Annotate with an arrow</h2>



<p>Okay, so we have a point at xy and an annotation at <code>xytext</code>. How can we connect the two? Can we draw an arrow from the annotation to the point? Absolutely! What we&#8217;ve done with annotate so far looks the same as if we&#8217;d just used the text method to put the point at (.5, .503). But annotate can also draw an arrow connecting the label to the point. The arrow is styled by passing a dictionary to <code>arrowprops</code>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
x, y, annotation = .5, .5, "annotation"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.003),arrowprops={'arrowstyle' : 'simple'})</pre>



<pre class="wp-block-preformatted">Text(0.5, 0.503, 'annotation')</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="378" height="248" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-23.png" alt="" class="wp-image-30246" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-23.png 378w, https://blog.finxter.com/wp-content/uploads/2021/05/image-23-300x197.png 300w" sizes="auto, (max-width: 378px) 100vw, 378px" /></figure>



<h2 class="wp-block-heading" id="Adjusting-the-arrow-length">Adjusting the arrow length</h2>



<p>It looks a little weird to have the arrow touch the point. How can we have the arrow go close to the point, but not quite touch it? Again, styling options are passed in a dictionary object. Larger values from <code>shrinkA</code> will move the tail further from the label and larger values of <code>shrinkB</code> will move the head farther from the point. The default for <code>shrinkA</code> and <code>shrinkB</code> is 2, so by setting <code>shrinkB</code> to 5 we move the head of the arrow further from the point.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
x, y, annotation = .5, .5, "annotation"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.003),arrowprops={'arrowstyle' : 'simple', 'shrinkB' : 5})</pre>



<pre class="wp-block-preformatted">Text(0.5, 0.503, 'annotation')</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="378" height="248" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-24.png" alt="" class="wp-image-30247" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-24.png 378w, https://blog.finxter.com/wp-content/uploads/2021/05/image-24-300x197.png 300w" sizes="auto, (max-width: 378px) 100vw, 378px" /></figure>



<h2 class="wp-block-heading" id="Does-the-annotate-method-have-the-same-styling-options-that-the-text-method-has?">Do the annotate and text methods have the same styling options?</h2>



<p>Yes, all the parameters that work with text will also work with annotate. So, for example, we can put the annotation in a text box and set the <code>fontstyle</code> as italic, the same way as we did above.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
x, y, text = .5, .7, "Italic text in grey box with\nrectangular box corner\ndemonstrating that the\nformatting options\nthat work with text\nalso work with annotate."
ax.scatter(x,y)
ax.annotate(text, xy=(x,y),xytext=(x,y+.01)
            ,fontstyle = 'italic'
            ,bbox={'facecolor': '.9', 'edgecolor':'blue', 'boxstyle':'square', 'alpha' : 0.5}
            ,arrowprops={'arrowstyle' : 'simple', 'shrinkB' : 5})</pre>



<pre class="wp-block-preformatted">Text(0.5, 0.71, 'Italic text in grey box with\nrectangular box corner\ndemonstrating that the\nformatting options\nthat work with text\nalso work with annotate.')</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="378" height="248" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-30.png" alt="" class="wp-image-30253" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-30.png 378w, https://blog.finxter.com/wp-content/uploads/2021/05/image-30-300x197.png 300w" sizes="auto, (max-width: 378px) 100vw, 378px" /></figure>



<h2 class="wp-block-heading" id="Are-there-any-shorthands-for-styling-the-arrow?">Are there any shorthands for styling the arrow?</h2>



<p>Yes, <code>arrowstyle</code> can be used instead of the other styling keys. More options <a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.patches.ArrowStyle.html?highlight=arrowstyle#matplotlib.patches.ArrowStyle" target="_blank" rel="noreferrer noopener">here</a> including <code>'fancy'</code>, <code>'simple'</code>, <code>'-'</code> and <code>'-&gt;'</code>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">fig, ax = plt.subplots()
x, y, annotation = .5, .5, "wedge style"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.01),arrowprops={'arrowstyle':'wedge'})
another_annotation = '- style'
ax.annotate(another_annotation,xy=(x,y),xytext=(x,y-.01),arrowprops={'arrowstyle':'-'})</pre>



<pre class="wp-block-preformatted">Text(0.5, 0.49, '- style')</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="378" height="248" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-27.png" alt="" class="wp-image-30250" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-27.png 378w, https://blog.finxter.com/wp-content/uploads/2021/05/image-27-300x197.png 300w" sizes="auto, (max-width: 378px) 100vw, 378px" /></figure>



<h2 class="wp-block-heading" id="How-can-we-annotate-all-the-points-on-a-scatter-plot?">How can we annotate all the points on a scatter plot?</h2>



<p>We can first create 15 test points with associated labels. Then loop through the points and use the annotate method at each point to add a label.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import random
random.seed(2)

x = range(15)
y = [element * (2 + random.random()) for element in x]
n = ['label for ' + str(i) for i in x]

fig, ax = plt.subplots()
ax.scatter(x, y)

texts = []
for i, txt in enumerate(n):
    ax.annotate(txt, xy=(x[i], y[i]), xytext=(x[i],y[i]+.3))</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="410" height="252" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-25.png" alt="" class="wp-image-30248" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-25.png 410w, https://blog.finxter.com/wp-content/uploads/2021/05/image-25-300x184.png 300w" sizes="auto, (max-width: 410px) 100vw, 410px" /></figure>



<h2 class="wp-block-heading" id="Handling-overlapping-annotations">Handling overlapping annotations</h2>



<p>The annotations are overlapping each other. How do we prevent that? You could manually adjust the location of each label, but that would be very time-consuming. Luckily the python library <a href="https://github.com/Phlya/adjustText" target="_blank" rel="noreferrer noopener" title="https://github.com/Phlya/adjustText">adjustText </a>will do the work for us. You&#8217;ll have to <a href="https://blog.finxter.com/how-to-install-a-python-package-with-a-whl-file/" target="_blank" rel="noreferrer noopener" title="How to Install a Python Package with a .whl File?">pip install</a> it first, and we&#8217;ll need to store the annotations in a list so that we can pass them as an argument to <code>adjust_tex</code>t. Doing this, we see for example that &#8220;label for 6&#8221; is shifted to the left so that it no longer overlaps with &#8220;label for 7.&#8221;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from adjustText import adjust_text

fig, ax = plt.subplots()
ax.scatter(x, y)

texts = []
for i, txt in enumerate(n):
    texts.append(ax.annotate(txt, xy=(x[i], y[i]), xytext=(x[i],y[i]+.3)))
    
adjust_text(texts)</pre>



<pre class="wp-block-preformatted">226</pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="368" height="252" src="https://blog.finxter.com/wp-content/uploads/2021/05/image-26.png" alt="" class="wp-image-30249" srcset="https://blog.finxter.com/wp-content/uploads/2021/05/image-26.png 368w, https://blog.finxter.com/wp-content/uploads/2021/05/image-26-300x205.png 300w" sizes="auto, (max-width: 368px) 100vw, 368px" /></figure>



<h2 class="wp-block-heading" id="Conclusion">Conclusion</h2>



<p>You should now be able to position and format text and annotations on your plots. Thanks for reading! Please check out my other work at <a href="https://www.learningtableau.com" target="_blank" rel="noreferrer noopener">LearningTableau</a>, <a href="https://www.powerbiskills.com" target="_blank" rel="noreferrer noopener">PowerBISkills</a>, and <a href="https://www.datasciencedrills.com" target="_blank" rel="noreferrer noopener">DataScienceDrills</a>.</p>
<p>The post <a href="https://blog.finxter.com/matplotlib-text-and-annotate-a-simple-guide/">Matplotlib Text and Annotate — A Simple Guide</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Logistic Regression Scikit-learn vs Statsmodels</title>
		<link>https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/</link>
		
		<dc:creator><![CDATA[Lukas Halim]]></dc:creator>
		<pubDate>Fri, 05 Feb 2021 15:44:50 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=22984</guid>

					<description><![CDATA[<p>What’s the difference between Statsmodels and Scikit-learn? Both have ordinary least squares and logistic regression, so it seems like Python is giving us two ways to do the same thing. Statsmodels offers modeling from the perspective of statistics. Scikit-learn offers some of the same models from the perspective of machine learning. So we need to ... <a title="Logistic Regression Scikit-learn vs Statsmodels" class="read-more" href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/" aria-label="Read more about Logistic Regression Scikit-learn vs Statsmodels">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/">Logistic Regression Scikit-learn vs Statsmodels</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>What’s the difference between Statsmodels and Scikit-learn? Both have ordinary least squares and logistic regression, so it seems like Python is giving us two ways to do the same thing. Statsmodels offers modeling from the perspective of <em>statistics</em>. Scikit-learn offers some of the same models from the perspective of <em>machine learning</em>.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Logistic Regression Scikit-learn vs Statsmodels" width="937" height="527" src="https://www.youtube.com/embed/inZpIyBm2Us?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>So we need to understand the difference between statistics and machine learning! Statistics makes mathematically valid inferences about a population based on sample data. Statistics answers the question, &#8220;What is the evidence that X is related to Y?&#8221; Machine learning has the goal of optimizing predictive accuracy rather than inference. Machine learning answers the question, &#8220;Given X, what prediction should we make for Y?&#8221;</p>



<p>In the example below, we&#8217;ll create a fake dataset with predictor variables and a binary Y variable. Then we&#8217;ll perform logistic regression with scikit-learn and statsmodels. We&#8217;ll see that scikit-learn allows us to easily tune the model to optimize predictive power. Statsmodels will provide a summary of statistical measures which will be very familiar to those who&#8217;ve used SAS or R.</p>



<p>If you need an intro to Logistic Regression, see <a href="https://blog.finxter.com/logistic-regression-in-one-line-python/">this</a><a href="https://blog.finxter.com/logistic-regression-in-one-line-python/" target="_blank" rel="noreferrer noopener"> </a><a href="https://blog.finxter.com/logistic-regression-in-one-line-python/">Finxter post</a>.</p>



<h2 class="wp-block-heading" id="Create-Fake-Data-for-the-Logistic-Regression-Model">Create Fake Data for the Logistic Regression Model</h2>



<p>I tried using some publicly available data for this exercise but didn&#8217;t find one with the characteristics I wanted. So I decided to create some fake data by using <a href="https://blog.finxter.com/numpy-tutorial/" target="_blank" rel="noreferrer noopener" title="NumPy Tutorial – Everything You Need to Know to Get Started">NumPy</a>! There&#8217;s a post <a href="https://data.library.virginia.edu/simulating-a-logistic-regression-model/" target="_blank" rel="noreferrer noopener">here</a> that explains the math and how to do this in R.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np
import pandas as pd

#The next line is setting the seed for the random number generator so that we get consistent results
rg = np.random.default_rng(seed=0)
#Create an array with 500 rows and 3 columns
X_for_creating_probabilities = rg.normal(size=(500,3))</pre>



<p>Create an array with the first column removed. The deleted column can be thought of as random noise, or as a variable that we don&#8217;t have access to when creating the model.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X1 = np.delete(X_for_creating_probabilities,0,axis=1)
X1[:5]
"""
array([[-0.13210486,  0.64042265],
       [-0.53566937,  0.36159505],
       [ 0.94708096, -0.70373524],
       [-0.62327446,  0.04132598],
       [-0.21879166, -1.24591095]])
"""</pre>



<p>Now we&#8217;ll create two more columns correlated with X1. Datasets often have highly correlated variables. Correlation increases the likelihood of overfitting. Concatenate to get a single array.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X2 = X1 + .1 * np.random.normal(size=(500,2))
X_predictors = np.concatenate((X1,X2),axis=1)</pre>



<p>We want to create our outcome variable and have it be related to X_predictors. To do that, we use our data as inputs to the logistic regression model to get probabilities. Then we set the outcome variable, Y, to True when the probability is above .5.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">P = 1 / (1 + np.e**(-np.matmul(X_for_creating_probabilities,[1,1,1])))
Y = P > .5
#About half of cases are True
np.mean(Y)
#0.498
﻿</pre>



<p>Now divide the data into training and test data. We&#8217;ll run a logistic regression on the training data, then see how well the model performs on the training data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">#Set the first 50 rows to train the model
X_train = X_predictors[:50]
Y_train = Y[:50]

#Set the remaining rows to test the model
X_test = X_predictors[50:]
Y_test = Y[50:]

print(f"X_train: {len(X_train)} X_test: {len(X_test)}")
#X_train: 50 X_test: 450</pre>



<h2 class="wp-block-heading" id="Logistic-regression-with-Scikit-learn">Logistic regression with Scikit-learn</h2>



<p>We&#8217;re ready to train and test models.</p>



<p>As we train the models, we need to take steps to avoid overfitting. A machine learning model may have very accurate results with the data used to train the model. But this does not mean it will be equally accurate when making predictions with data it hasn&#8217;t seen before. When the model fails to generalize to new data, we say it has &#8220;overfit&#8221; the training data. Overfitting is more likely when there are few observations to train on, and when the model uses many correlated predictors.</p>



<p>How to avoid overfitting? By default, <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">scikit-learn</a>&#8216;s logistic regression applies regularization. Regularization balances the need for predictive accuracy on the training data with a penalty on the magnitude of the model coefficients. Increasing the penalty reduces the coefficients and hence reduces the likelihood of overfitting. If the penalty is too large, though, it will reduce predictive power on both the training and test data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
scikit_default = LogisticRegression(random_state=0).fit(X_train, Y_train)
print(f"intecept: {scikit_default.intercept_} coeficients: {scikit_default.coef_}")
print(f"train accuracy: {scikit_default.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_default.score(X_test, Y_test)}")
"""
Results will vary slightly, even when you set random_state.
intecept: [-0.44526823] coeficients: [[0.50031563 0.79636504 0.82047214 0.83635656]]
train accuracy: 0.8
test accuracy: 0.8088888888888889
"""</pre>



<p>We can set turn off regularization by setting penalty as none. Applying regularization reduces the magnitude of the coefficients. Setting the penalty to none will increase the coefficients. Notice that the accuracy on the test data decreases. This indicates our model has overfit the training data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
scikit_no_penalty = LogisticRegression(random_state=0,penalty='none').fit(X_train, Y_train)
print(f"intecept: {scikit_no_penalty.intercept_} coeficients: {scikit_no_penalty.coef_}")
print(f"train accuracy: {scikit_no_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_no_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.63388911] coeficients: [[-3.59878438  0.70813119  5.10660019  1.29684873]]
train accuracy: 0.82
test accuracy: 0.7888888888888889
"""
﻿</pre>



<p>C is 1.0 by default. Smaller values of C increase the regularization, so if we set the value to .1 we reduce the magnitude of the coefficients.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
scikit_bigger_penalty = LogisticRegression(random_state=0,C=.1).fit(X_train, Y_train)
print(f"intecept: {scikit_bigger_penalty.intercept_} \
    coeficients: {scikit_bigger_penalty.coef_}")
print(f"train accuracy: {scikit_bigger_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_bigger_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.13102803]     coeficients: [[0.3021235  0.3919277  0.34359251 0.40332636]]
train accuracy: 0.8
test accuracy: 0.8066666666666666
"""
﻿</pre>



<p>It&#8217;s nice to be able to adjust the smoothing coefficient, but how do we decide the optimal value? Scikit-learn&#8217;s GridSearchCV provides an effective but easy to use method for choosing an optimal value. The &#8220;Grid Search&#8221; in <strong>GridSearch</strong>CV means that we supply a <a href="https://blog.finxter.com/python-dictionary/" target="_blank" rel="noreferrer noopener" title="Python Dictionary – The Ultimate Guide">dictionary </a>with the parameter values we wish to test. The model is fit with all combinations of those values. If we have 4 possible values for C and 2 possible values for solver, we will search through all 4X2=8 combinations.</p>



<h3 class="wp-block-heading" id="GridSearchCV-Searches-Through-This-Grid">GridSearchCV Searches Through This Grid</h3>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>C</th><th>solver</th></tr></thead><tbody><tr><td>.01</td><td>newton-cg</td></tr><tr><td>.1</td><td>newton-cg</td></tr><tr><td>1</td><td>newton-cg</td></tr><tr><td>10</td><td>newton-cg</td></tr><tr><td>.01</td><td>lbfgs</td></tr><tr><td>.1</td><td>lbfgs</td></tr><tr><td>1</td><td>lbfgs</td></tr><tr><td>10</td><td>lbfgs</td></tr></tbody></table></figure>



<p>The &#8220;CV&#8221; in GridSearch<strong>CV</strong> stands for <strong>c</strong>ross-<strong>v</strong>alidation. Cross-validation is the method of segmenting the training data. The model is trained on all but one of the segments and the remaining segment validate the model.</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>Iteration</th><th>Segment 1</th><th>Segment 2</th><th>Segment 3</th><th>Segment 4</th><th>Segment 5</th></tr></thead><tbody><tr><td>1st Iteration</td><td>Validation</td><td>Train</td><td>Train</td><td>Train</td><td>Train</td></tr><tr><td>2nd Iteration</td><td>Train</td><td>Validation</td><td>Train</td><td>Train</td><td>Train</td></tr><tr><td>3rd Iteration</td><td>Train</td><td>Train</td><td>Validation</td><td>Train</td><td>Train</td></tr><tr><td>4th Iteration</td><td>Train</td><td>Train</td><td>Train</td><td>Validation</td><td>Train</td></tr><tr><td>5th Iteration</td><td>Train</td><td>Train</td><td>Train</td><td>Train</td><td>Validation</td></tr></tbody></table></figure>



<p></p>



<p>GridSearch and cross-validation work in combination. GridsearchCV iterates through values of C and solver for different test and training segments. The algorithm selects the best estimator based performance on the validation segments.</p>



<p>Doing this allows us to determine which values of C and solver work best for our training data. This is how <a href="https://blog.finxter.com/deploying-a-machine-learning-model-in-fastapi/" target="_blank" rel="noreferrer noopener" title="Deploying a machine learning model in FastAPI">scikit-learn</a> helps us to optimize predictive accuracy.</p>



<p>Let&#8217;s see it in action.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.model_selection import GridSearchCV
parameters = {'C':[.01, .1, 1, 10],'solver':['newton-cg','lbfgs']}
Logistic = LogisticRegression(random_state=0)
scikit_GridSearchCV = GridSearchCV(Logistic, parameters)
scikit_GridSearchCV.fit(X_train, Y_train)
print(f"best estimator: {scikit_GridSearchCV.best_estimator_}")
#best estimator: LogisticRegression(C=0.1, random_state=0, solver='newton-cg')</pre>



<p>Use the score method returns the mean accuracy on the given test data and labels. Accuracy is the percent of observations correctly predicted.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(f"train accuracy: {scikit_GridSearchCV.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_GridSearchCV.score(X_test, Y_test)}")
"""
train accuracy: 0.82
test accuracy: 0.8133333333333334
"""</pre>



<h2 class="wp-block-heading" id="Logistic-regression-with-Statsmodels">Logistic regression with Statsmodels</h2>



<p>Now let&#8217;s try the same, but with statsmodels. With scikit-learn, to turn off regularization we set <code>penalty='none'</code>, but with statsmodels regularization is turned off by default. A quirk to watch out for is that Statsmodels does not include an intercept by default. To include an intercept, we use the sm.add_constant method.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import statsmodels.api as sm

#adding constant to X
X_train_with_constant = sm.add_constant(X_train)
X_test_with_constant = sm.add_constant(X_test)

# building the model and fitting the data
sm_model_all_predictors = sm.Logit(Y_train, X_train_with_constant).fit()

# printing the summary table
print(sm_model_all_predictors.params)
"""
Optimization terminated successfully.
         Current function value: 0.446973
         Iterations 7
[-0.57361523 -2.00207425  1.28872367  3.53734636  0.77494424]
"""</pre>



<p>If you&#8217;re used to doing logistic regression in R or SAS, what comes next will be familiar. Once we have trained the logistic regression model with statsmodels, the summary method will easily produce a table with statistical measures including p-values and confidence intervals.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sm_model_all_predictors.summary()</pre>



<figure class="wp-block-table is-style-stripes"><table><tbody><tr><th>Dep. Variable:</th><td>y</td><th>No. Observations:</th><td>50</td></tr><tr><th>Model:</th><td>Logit</td><th>Df Residuals:</th><td>45</td></tr><tr><th>Method:</th><td>MLE</td><th>Df Model:</th><td>4</td></tr><tr><th>Date:</th><td>Thu, 04 Feb 2021</td><th>Pseudo R-squ.:</th><td>0.3846</td></tr><tr><th>Time:</th><td>14:33:19</td><th>Log-Likelihood:</th><td>-21.228</td></tr><tr><th>converged:</th><td>True</td><th>LL-Null:</th><td>-34.497</td></tr><tr><th>Covariance Type:</th><td>nonrobust</td><th>LLR p-value:</th><td>2.464e-05</td></tr></tbody></table></figure>



<figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><th>coef</th><th>std err</th><th>z</th><th>P&gt;|z|</th><th>[0.025</th><th>0.975]</th></tr><tr><th>const</th><td>-0.7084</td><td>0.478</td><td>-1.482</td><td>0.138</td><td>-1.645</td><td>0.228</td></tr><tr><th>x1</th><td>5.5486</td><td>4.483</td><td>1.238</td><td>0.216</td><td>-3.237</td><td>14.335</td></tr><tr><th>x2</th><td>10.2566</td><td>5.686</td><td>1.804</td><td>0.071</td><td>-0.887</td><td>21.400</td></tr><tr><th>x3</th><td>-3.9137</td><td>4.295</td><td>-0.911</td><td>0.362</td><td>-12.333</td><td>4.505</td></tr><tr><th>x4</th><td>-7.8510</td><td>5.364</td><td>-1.464</td><td>0.143</td><td>-18.364</td><td>2.662</td></tr></tbody></table></figure>



<p>There&#8217;s a lot here, but we&#8217;ll focus on the second table with the coefficients.</p>



<p>The first column shows the value for the coefficient. The fourth column, with the heading P&gt;|z|, shows the p-values. A p-value is a probability measure, and p-values above .05 are frequently considered, &#8220;not statistically significant.&#8221; None of the predictors are considered statistically significant! This is because we have a relatively small number of observations in our training data and because the predictors are highly correlated. Some statistical packages like R and SAS have built-in methods to select the features to include in the model based on which predictors have low (significant) p-values, but unfortunately, this isn&#8217;t available in statsmodels.</p>



<p>If we try again with just x1 and x2, we&#8217;ll get a completely different result, with very low p-values for x1 and x2, meaning that the evidence for a relationship with the dependent variable is statistically significant. We&#8217;re cheating, though &#8211; because we created the data, we know that we only need x1 and x2.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sm_model_x1_x2 = sm.Logit(Y_train, X_train_with_constant[:,:3]).fit()
sm_model_x1_x2.summary()</pre>



<p>Now we see x1 and x2 are both statistically significant.</p>



<p>Statsmodels doesn&#8217;t have the same accuracy method that we have in scikit-learn. We&#8217;ll use the predict method to predict the probabilities. Then we&#8217;ll use the decision rule that probabilities above .5 are true and all others are false. This is the same rule used when scikit-learn calculates accuracy.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">all_predicted_train = sm_model_all_predictors.predict(X_train_with_constant)>.5
all_predicted_test = sm_model_all_predictors.predict(X_test_with_constant)>.5

x1_x2_predicted_train = sm_model_x1_x2.predict(X_train_with_constant[:,:3])>.5
x1_x2_predicted_test = sm_model_x1_x2.predict(X_test_with_constant[:,:3])>.5

#calculate the accuracy
print(f"train: {(Y_train==all_predicted_train).mean()} and test: {(Y_test==all_predicted_test).mean()}")
print(f"train: {(Y_train==x1_x2_predicted_train).mean()} and test: {(Y_test==x1_x2_predicted_test).mean()}")
"""
train: 0.8 and test: 0.8066666666666666
train: 0.8 and test: 0.8111111111111111
"""</pre>



<h2 class="wp-block-heading" id="Summarizing-The-Results">Summarizing The Results</h2>



<p>Let&#8217;s create a <a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/" target="_blank" rel="noreferrer noopener" title="How to Create a DataFrame in Pandas?">DataFrame </a>with the results. The models have identical accuracy on the training data, but different results on the test data. The models with all the predictors and without smoothing have the worst test accuracy, suggesting that they have overfit on the training data and so do not generalize well to new data.</p>



<p>Even if we use the best methods in creating our model, there is still chance involved in how well it generalizes to the test data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">lst = [['scikit-learn','default', scikit_default.score(X_train, Y_train),scikit_default.score(X_test, Y_test)],
       ['scikit-learn','no penalty', scikit_no_penalty.score(X_train, Y_train),scikit_no_penalty.score(X_test, Y_test)],
       ['scikit-learn','bigger penalty', scikit_bigger_penalty.score(X_train, Y_train),scikit_bigger_penalty.score(X_test, Y_test)],
       ['scikit-learn','GridSearchCV', scikit_GridSearchCV.score(X_train, Y_train),scikit_GridSearchCV.score(X_test, Y_test)],
       ['statsmodels','include intercept and all predictors', (Y_train==all_predicted_train).mean(),(Y_test==all_predicted_test).mean()],
       ['statsmodels','include intercept and x1 and x2', (Y_train==x1_x2_predicted_train).mean(),(Y_test==x1_x2_predicted_test).mean()]
      ]
df = pd.DataFrame(lst, columns =['package', 'setting','train accuracy','test accuracy'])
df</pre>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th></th><th>package</th><th>setting</th><th>train accuracy</th><th>test accuracy</th></tr></thead><tbody><tr><th>0</th><td>scikit-learn</td><td>default</td><td>0.80</td><td>0.808889</td></tr><tr><th>1</th><td>scikit-learn</td><td>no penalty</td><td>0.78</td><td>0.764444</td></tr><tr><th>2</th><td>scikit-learn</td><td>bigger penalty</td><td>0.82</td><td>0.813333</td></tr><tr><th>3</th><td>scikit-learn</td><td>GridSearchCV</td><td>0.80</td><td>0.808889</td></tr><tr><th>4</th><td>statsmodels</td><td>include intercept and all predictors</td><td>0.78</td><td>0.764444</td></tr><tr><th>5</th><td>statsmodels</td><td>include intercept and x1 and x2</td><td>0.80</td><td>0.811111</td></tr></tbody></table></figure>



<h2 class="wp-block-heading" id="Scikit-learn-vs-Statsmodels">Scikit-learn vs Statsmodels</h2>



<p>Upshot is that you should use Scikit-learn for logistic regression unless you need the statistics results provided by StatsModels.</p>



<p>Here&#8217;s a table of the most relevant similarities and differences:</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th></th><th>Scikit-learn</th><th>Statsmodels</th></tr></thead><tbody><tr><td>Regularization</td><td>Uses L2 regularization by default, but regularization can be turned off using penalty=&#8217;none&#8217;</td><td>Does not use regularization by default</td></tr><tr><td>Hyperparameter tuning</td><td>GridSearchCV allows for easy tuning of regularization parameter</td><td>User will need to write lines of code to tune regularization parameter</td></tr><tr><td>Intercept</td><td>Includes intercept by default</td><td>Use the add_constant method to include an intercept</td></tr><tr><td>Model Evaluation</td><td>The score method reports prediction accuracy</td><td>The summary method shows p-values, confidence intervals, and other statistical measures</td></tr><tr><td>When should you use it?</td><td>For accurate predictions</td><td>For statistical inference.</td></tr><tr><td>Comparison with R and SAS</td><td>Different</td><td>Similar</td></tr></tbody></table></figure>



<p>That&#8217;s it for now! Please check out my other work at <a href="http://learningtableau.com" target="_blank" rel="noreferrer noopener">learningtableau.com</a> and my new site <a href="http://datasciencedrills.com" target="_blank" rel="noreferrer noopener">datasciencedrills.com</a>.</p>
<p>The post <a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/">Logistic Regression Scikit-learn vs Statsmodels</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Execute Python from Tableau with TabPy</title>
		<link>https://blog.finxter.com/execute-python-from-tableau-with-tabpy/</link>
		
		<dc:creator><![CDATA[Lukas Halim]]></dc:creator>
		<pubDate>Sun, 13 Dec 2020 16:26:21 +0000</pubDate>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Scripting]]></category>
		<category><![CDATA[Tableau]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=18376</guid>

					<description><![CDATA[<p>Are you trying to understand how to call Python code from Tableau? Maybe you tried other online resources but ran into frustrating errors. This TabPy tutorial will show you how to get the TabPy installed and setup, and will get you running Python code in Tableau. Installing Tableau Desktop If you need Tableau Desktop, you ... <a title="Execute Python from Tableau with TabPy" class="read-more" href="https://blog.finxter.com/execute-python-from-tableau-with-tabpy/" aria-label="Read more about Execute Python from Tableau with TabPy">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/execute-python-from-tableau-with-tabpy/">Execute Python from Tableau with TabPy</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Are you trying to understand how to call Python code from Tableau? Maybe you tried other online resources but ran into frustrating errors. This TabPy tutorial will show you how to get the TabPy installed and setup, and will get you running Python code in Tableau.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Execute Python from Tableau with TabPy" width="937" height="527" src="https://www.youtube.com/embed/OReJXfnjTZ0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Installing Tableau Desktop</h2>



<p>If you need Tableau Desktop, you can get a 14-day trial here: <a href="https://www.tableau.com/products/desktop/download">https://www.tableau.com/products/desktop/download</a></p>



<p><strong>Note</strong>: Tableau Public, the free license version of Tableau, <em>does not</em> support Python integration.</p>



<h2 class="wp-block-heading">TabPy Installation</h2>



<p>Reading the documentation, this should be as simple as:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install tabpy</pre>



<p>Perhaps this will be all you need to get TabPy installed. But when I tried the install failed. This was due to a failure to install on one of the dependencies, a Python package called Twist. A search on StackOverflow leads to this solution (<a href="https://stackoverflow.com/questions/36279141/pip-doesnt-install-twisted-on-windows" target="_blank" rel="noreferrer noopener">https://stackoverflow.com/questions/36279141/pip-doesnt-install-twisted-on-windows</a>) and to this unofficial Windows binary available at (<a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted" target="_blank" rel="noreferrer noopener">http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted</a>). I downloaded the appropriate binary for my version of Python, navigated to the download directory, and installed with this command:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install Twisted-20.3.0-cp38-cp38-win_amd64.whl</pre>



<p>That installed Twist, and I was then able to install TabPy as expected.</p>



<h2 class="wp-block-heading">TabPy Setup</h2>



<p>With TabPy installed, starting the TabPy server can be done from the command prompt:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">TabPy</pre>



<p>You should see a message like the one below, telling you that the web service is listening on port 9004:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh5.googleusercontent.com/Q3ZqZsUpFnVsLYxvCk3AkmJKiooU-ZkKGY30gqWRZ7WHyztzIDYu5gqzOhJocLqyjTUITegMBRoVMOGGK5kZFpzKKIzwktITcdd3V_36Q1SyTExbLmxL6eVqb0AcpZ5NAKAuoid6" alt=""/></figure></div>



<p>With TabPy running, start Tableau Desktop.</p>



<p>In Tableau Desktop, click <strong>Help</strong> on the toolbar, then <strong>Settings and Performance &gt; Manage Analytics Extension Connection</strong>.</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/0Lxe_JP9CQnNkNVTE_Fc0vuOyFei15E-n_jcfKVVAXX6eCQF_2qAk_1aSEwfSybfF90AVml6LoBwz8XhYQ2Y9byiVL8vCIHtQ19glZjvQ_Lu4JIcU2gKE7L0XpcNPyX2DiMwCqYt" alt=""/></figure></div>



<p><strong></strong></p>



<p>Then select TabPy/External API, select localhost for the server, and set the port to 9004</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/KWUwtyhtY_vfOiLuP81b2lwdlaLoX92DdC6l2YwIaecKKned25aCiJgkG6hFllE2U6uw4DCtEBQCHQuDkJkIKXx7rYIwIVxpOikiS-LoB_4C4KdLflSggL_tsxqkJrKRUHHVwXs1" alt=""/></figure></div>



<h2 class="wp-block-heading">TabPy Examples</h2>



<p>The first example shows how to use a <a href="https://blog.finxter.com/numpy-tutorial/" target="_blank" rel="noreferrer noopener" title="NumPy Tutorial – Everything You Need to Know to Get Started">NumPy </a>function on aggregated data to calculate the Pearson correlation coefficient.</p>



<p>The second example shows how to use a TabPy deployed function to do a t-test on disaggregated data.</p>



<h3 class="wp-block-heading">Example &#8211; Correlation on Aggregated Data</h3>



<p>We have TabPy running and Tableau’s analytics extension configured. Now we’ll call Python code from Tableau.</p>



<p>Downloaded data on the wages and education of young males (<a href="https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Males.csv">https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Males.csv</a>) and open using the Connect to Text File option.&nbsp;</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/6A7TiltwqXOq5b_n6XUgxKKG1QpO0P-iuCOpewa-G5sYG251SkvkzbxF-hxe1ORyprU4bzZYq75ohmNfdytiNTuX_spdfqHDD0p6jN4WJiE0Xf-aTeM1tVqv6vltxLGcx5kPVSAA" alt=""/></figure></div>



<p>Select Sheet1 to start a new worksheet.</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/jYeVWrHGFuJvyaPTDgF5BXG1ItvqfiXL9PxYxb1yHVOl5_aiYeudQqlguM4BfjQc7Nxri6C1Fv3PLglMeUcdddnXdDaLZlrB_lY3aujsZN8mg4i-qYJ0qOCyhwdxZqn5Z_DAuHFY" alt=""/></figure></div>



<p>Maried is spelled without the second ‘r’, so right-click on the field and rename it to “Married.”</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/DGDGzIUEYHi03SDcI34smCppYPBg3Kq52erWAhKo5wsJL2U60ckxV1zIhRbJgdAHcJ5IRCC_oHFLFOBni_uMqmaUJdGr2z9Tf1OnvUvvoqe__0gtBUa-Yz1fgt1TYALdgvgqnAqB" alt=""/></figure></div>



<p>Drag “Married” and “Experience” to the row shelf, and double-click on Exper and Wage:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/q3_Q8eNRvfa_icmPWStIrKbFz1nMIXy278b8XOoaMVpLyW76xUq8DSzu3MATdROoqQ5P6ET1vVlNXPp4AlpOA5F_X3Kl-OPL5Ll1924Yu_0ij6VU8oth0mH4tw48a0ImaVq2mwOR" alt=""/></figure></div>



<p>Next, change SUM(Exper) to AVG(Exper) and SUM(Wage) to AVG(Exper):</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="627" height="404" src="https://blog.finxter.com/wp-content/uploads/2020/12/image-23.png" alt="" class="wp-image-18390" srcset="https://blog.finxter.com/wp-content/uploads/2020/12/image-23.png 627w, https://blog.finxter.com/wp-content/uploads/2020/12/image-23-300x193.png 300w, https://blog.finxter.com/wp-content/uploads/2020/12/image-23-150x97.png 150w" sizes="auto, (max-width: 627px) 100vw, 627px" /></figure></div>



<p>The view should now look like this:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/8I44Z1uWhAzCftfWrlMi1FQQUE2oToP7-jYdUNM_UnF6pzHNDdb-EoLlm2z4PVR1iLJ22HHEbmIax-ruHrindjAaizQwDUzKsgIl03u2EdQXR-6rB3n28NvcpHq4wuJ9YyBQxwed" alt=""/></figure></div>



<p>Now let’s add a calculation with some Python code! You can create a calculation by clicking on the Analysis tab on the toolbar and then “Create Calculated Field”</p>



<p>Call the calculation “TabPy Corr” and use this expression:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">SCRIPT_REAL("import numpy as np
print(f'arg1_: {_arg1}')
print(f'arg2_: {_arg2}')
print(f'return: {np.corrcoef(_arg1,_arg2)[0,1]}')
return np.corrcoef(_arg1,_arg2)[0,1]",avg([Exper]),avg([Wage])
)
</pre>



<p>The print statements allow us to see the data exchange between Tableau and the TabPy server. Switch to the command prompt to see:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/KEUqTqNQLHiErHD4ygZZb5YQIem8l5h2qz7ELoLXys0tN0dz519U-gkAx6Fp3PJU_FU6nrApx1rQ9sbR_D4Se2lhTIRWZlhUTIrezK5D-UqE84BdWORqWu4cbIGeJZHEJOzuagdb" alt=""/></figure></div>



<p>Tableau is sending two <a href="https://blog.finxter.com/python-lists/" target="_blank" rel="noreferrer noopener" title="The Ultimate Guide to Python Lists">lists</a>, <code>_arg1</code> and <code>_arg2</code>, to the TabPy server. <code>_arg1</code> is a list with the values from <code>avg([Exper])</code> and <code>_arg2</code> is a list with the values from <code>avg([Wage])</code>.</p>



<p>TabPy returns a single value representing the correlation of <code>avg([Exper])</code> and <code>avg([Wage])</code>.</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh5.googleusercontent.com/Yke-cd0tV17atxSnD_COrmXpBI3NVXjvuOQGjgtBxR9lMLN7ejxPfoPQlZlarDf9G8-f6xoqwiO0Owxe7Cii11DrvX2LaV42NkxtFBaxnfynJKhUqsipNFtAoOEemu70ywdM2PUS" alt=""/></figure></div>



<p>We return <code>np.corrcoef(_arg1,_arg2)[0,1]</code> instead of just <code>np.corrcoef(_arg1,_arg2)</code> because <code>np.corrcoef(_arg1,_arg2)</code> returns a 2&#215;2 correlation matrix, but Tableau expects either a single value or a list of values with the same length as <code>_arg1</code> and <code>_arg2</code>. If we return a the 2&#215;2 matrix, Tableau will give us the error message, <code>“TypeError : Object of type ndarray is not JSON serializable“</code></p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh6.googleusercontent.com/fZmzi1rQizLlS4feoEOu6YsO4O5pE_yCuyxaJi6o75QrKWLGSuUAoga1jEWvsnkahHFOrSFatUHIyBLMTksYFOBgVgjNnRvZSTQu8SJVgzcDUU9SlKJBcZTgFQHnOfFOqPRENE19" alt=""/></figure></div>



<p>The functions used to communicate with the TabPy server, <code>SCRIPT_REAL, SCRIPT_INT, SCRIPT_BOOL</code> and <code>SCRIPT_STR</code> are “table calculations,” which means that the input parameters must be aggregated. For example, <code>AVG([Exper])</code> is an acceptable parameter, but <code>[Exper]</code> is not. Table calculations work not on the data in the underlying dataset (<code>Males.csv</code> for this example) but on the values aggregated to the level shown in the Tableau worksheet. Tableau sends TabPy lists with the aggregated values.</p>



<p>We use <code>SCRIPT_REAL</code> rather than one of the other <code>SCRIPT_*</code> functions because our function will return a <a href="https://blog.finxter.com/decimal-pythons-float-trap-and-how-to-solve-it/" target="_blank" rel="noreferrer noopener" title="Decimal: Python’s Float Trap and How to Solve it">float</a>. If, for example, the function was instead returning a string, we would use <code>SCRIPT_STR</code>.</p>



<p>One call is made from Tableau to TabPy for each partition in the table calculation. The default is Table(down) which uses a single partition for the entire table:</p>



<p>We can change the partition by selecting edit then table calculation:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/bzEN-sjBbKcE8Tb6GwS8xYnQ58LZR-Zah7cVX8Tv3K52drXcI0RHUd0Rti_ed_rSkz_M5G9yh0N2hCoJN88FxLKw-VE9zU2xFOEPi1HoEIR5ZIK6Xm0jFyLk0_vsAvBqW2DV72LG" alt=""/></figure></div>



<p>Currently, the Table Calculation is computed using Table(down), which means that Tableau goes down all of the rows in the Table. You can see that all of the values are highlighted in yellow.</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/pcIqCuAkhMtrqHUXka702KlYsKs6bWldBoYNNbSnVFiFPR6tgyZqtCEchnCNiraFkcEXuQg1XbJIqN9GWl-B6agkMFVa8vpIIB813xgmGG4GqPSCYozTOBSo288tdiZ9Niczcq3N" alt=""/></figure></div>



<p>If we change from Table(down) to Pane(down) the table calculation will be done separately for each pane. The rows of the table are divided into two panes &#8211; one for married = no and another for married=yes. Therefore, there are two separate calls to TabPy, one for maried no and a second for <code>maried=yes</code>. Each call gets a separate response.</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/vRd1QXWSbTi1C8Q1LTMHHafKyqvw8c33TLOEvYZ33Jt_9QsxiONdr45mIwP6B-2C8bdpDFySdM0vEquIo6t_H3B4UEgYoMO30Xc3omwnVjbdfgbqXZUkjiMRtFpYdRRoFfjZDg5u" alt=""/></figure></div>



<p>We can see the exchange of data by switching back to the command prompt:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh5.googleusercontent.com/w_QqDZPByKbvgA278cu_wfco6PFLnQg48CKu30mZqCJKHXb3mg7Ru9V4Wdbg-FANyU9nqjwcOvMig5uyE_8j1zWTT3mRW49FPG5ctdVnCsfbu_NuzIDKvVTimx19MvG1a7B3di-7" alt=""/></figure></div>



<p>The print statements show what is happening. The first call to TabPy represents the partition where married=no. Lists are sent with the average wage and experience values and the value returned is -0.3382. The second call represents the partition where married=yes, the related average wage and experience values are sent, and the function returns -0.0120. Tableau displays the results.</p>



<p>We called Python code from Tableau and used the results in our worksheet. Excellent!</p>



<p>But we could have done the same thing much more easily without Python by using Tableau’s <code>WINDOW_CORR</code> function:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/LqGkhJvjzn7ThBUm-lL8ikwVZ9CN-3ph8NvPaCcjiN1XMVtJquhiTmKkNNR8AgmSUU_tDtKnBe_E-Rfev2otpOdW4vPXr0KlfpTwXMwjvxZ-93H1yFyGRXPAyCgGeTBd3Y2NR37G" alt=""/></figure></div>



<p>We can add this to the view and see that it gives <em>the same results</em> using either Table(down) or Pane(down):</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/aLKS6q8uhcRtUtVh_W25xWeV5fWG3RUu4gjY5AuhDfWRHt05G61-z1Psn2t0kIt7CqMV9A8alu4pCJmUXkflQ95S2ZZBghqQEBKvgdd6gIlnfVoN0oGfKE7Qbr96ygW_Jim2Z7nh" alt=""/></figure></div>



<p>This example is great for understanding TabPy. But we don’t need to use Python to calculate correlation since Python already has WINDOW_CORR built-in.</p>



<h3 class="wp-block-heading">Example &#8211; Two-sample T-Test Disaggregated Data</h3>



<p>If our data represents a sample of the general male population, then we can use statistics to make inferences about the population based on our sample. For example, we might want to ask whether our sample gives evidence that males in the general population who are unionized have more experience than those who are not. The test for this is a two-sample t-test. You can learn more about it here: (<a href="https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing" target="_blank" rel="noreferrer noopener">https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing</a>).</p>



<p>Unlike correlation, Tableau does not have a built-in t-test. So we will use Python to do a t-test.&nbsp;</p>



<p>But first, we will set up a new worksheet. The documentation here (<a href="https://github.com/tableau/TabPy/blob/master/docs/tabpy-tools.md#t-test" target="_blank" rel="noreferrer noopener">https://github.com/tableau/TabPy/blob/master/docs/tabpy-tools.md#t-test</a>) explains what we need to pass to the t-test function. We need to pass _arg1 with the years of experience and _arg2 as the categorical variable that maps each observation to either sample1 (Union=yes) or sample2 (Union=no).</p>



<p>Let’s start by creating a new view with Union on the row shelf and <code>AVG(Exper)</code> on the column shelf:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/oU85eSpQB0dHc1Men7hQDTJLlAa-0bQUv8YK_2JWCfUC5krrb1k7t9XTwTH_Zl5SOGPFVCVPro2QAlY_RO8R6C_x9v5tH-v4UCQ-NAATQk9Ir-iQZ46ClIA9UCOXImivPu2qx20H" alt=""/></figure></div>



<p>Disaggregate measures by unchecking:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/oD6HRkqgF6y8tZ67G3l3bf0J8LxWg2E6sIBq5-L6GqDGOZO3rUaCarm4sRmxbvIHYr41E0YVeUQ3W-1QWwTgBXM3NXYqRW-T6e7yBgyhjBqIKwLdkZasqGp5TjydJPfkZzPerugP" alt=""/></figure></div>



<p>With aggregate measures unchecked, <code>AVG(Exper)</code> should change to <code>Exper</code>. Use the “Show me” menu to change to a box-and-whisker plot:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/oqqFj1siT9kmLh03n6MQbuvV1RARKJBcANoVEQ-mG-usrVqPLTCOZ7HKdajBjZDbK-Idzh4Re0J_YwFacSUxRaY_escqxhs9ICN39i7MFH1bSC0x0cW5E9TiSbcPylHF4ZL4pXeu" alt=""/></figure></div>



<p>Our view is set, except for the t-test. The t-test is one of the models included with TabPy, explained here (<a href="https://github.com/tableau/TabPy/blob/master/docs/tabpy-tools.md#predeployed-functions" target="_blank" rel="noreferrer noopener">https://github.com/tableau/TabPy/blob/master/docs/tabpy-tools.md#predeployed-functions</a>). We need to run a command before we can run t-tests. With the TabPy server running, open a <em>second </em>command prompt and enter the following command:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">tabpy-deploy-models</pre>



<p>You should see a result like this:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh6.googleusercontent.com/rGKiF5NpH2Z3025xh9ZQgp8oPfXLKDvJ7SpPIvdu2xtLe_MhER139jQfzpg5fTYb2EOhYVDWJabAA3wPxUgRYTrO5N5dTK5-nIl3ZKiXBckRi-a2RcttlCmcFhc8I9cSHGKlg4XL" alt=""/></figure></div>



<p>If it’s successful, you can now call anova, PCA, Sentiment Analysis, and t-tests from Tableau!</p>



<p>Create a new calculation, “Union Exper Ttest,” which will determine whether there is a statistically significant&nbsp; difference in average experience for the unionized compared with the non-unionized.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">SCRIPT_REAL("print(f'unique values: {len(set(_arg2))}')
return tabpy.query('ttest',_arg1,_arg2)['response']"
,avg([Exper]),attr([Union]))</pre>



<p>Because <code>SCRIPT_REAL</code> is a table calculation the parameters have to be aggregated (using avg and attr) but with the “aggregate measures” unchecked the view is showing individual observations from <code>Males.csv</code> anyway, the individual values are passed to TabPy.</p>



<p>Drag the new calculation to the tooltip to show it in the view:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/10I2VUMvSKY2nGQKKQgufXCexJbzPuMGASFjHKBs9PcY9oDfFhdl05Cd10EeQ_kLOYWdE53k4vXh6THRt8IIRVYOaonRozHXagAWHk9lye7oZ7c9pCp1yAEttdgkB5teTwoCqArA" alt=""/></figure></div>



<p>The t-test returns a p-value of 0.4320. We can interpret this to mean that we do not find evidence for a difference in average years experience for unionized versus non-unionized males. The average experience in our sample data is different for unionized men compared with non-unionized men, but because the p-value is high we don’t have evidence of a difference in the general population..</p>



<p>Tableau does not have a t-test built-in, but we have added it using Python!</p>



<h2 class="wp-block-heading">Troubleshooting</h2>



<p>You’re very likely to encounter errors when setting up calculations with TabPy. Here’s an example. If we try switching the table calculation from Table(down) to Cell, we get this message:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/JgJ0tt8-XnpEcwayEy3USN99DJMnKjVazRqa3JN18Evft8L031waKv1DJiNCAR_IbKw___lqeodwC3j_Lc09WGw7R0Vv5QY4e3j8wnYJCEZeMAy6DpyLiq-3Irn7-9Q4Qx5s7VlI" alt=""/></figure></div>



<p><code>_arg1</code> and<code> _arg2</code> are lists, so what’s the problem?&nbsp; The error message we see in Tableau doesn’t help us to pinpoint the problem. If we switch to the command prompt, we can see the stack trace:</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/dgZljigumcE6NnS48JrT8XK8Fmpev-sXQT8hXJBcKHZE8AyOotz7sNwI2eCBO1cNZzNL1kO7lJU4U3TpHqfdpGnxAEOepTp4c8Y2XbiiFRSBXIcJRkk1OBOw3xVXktDXmxW520xo" alt=""/></figure></div>



<p>The stack trace tells us that line 34 is throwing the error. We can look at the <code>ttest.py</code> code here <a href="https://github.com/tableau/TabPy/blob/master/tabpy/models/scripts/tTest.py" target="_blank" rel="noreferrer noopener">https://github.com/tableau/TabPy/blob/master/tabpy/models/scripts/tTest.py</a> to better understand the error.&nbsp;</p>



<p>The problem is that if we are doing a two-sample t-test, we can do it in one of two ways:</p>



<ol class="wp-block-list"><li>Send <code>_arg1</code> and <code>_arg2</code> as the two different samples. For example, <code>_arg1</code> could be <code>[1, 4, 1]</code> and <code>_arg2</code> be <code>[3, 4, 5]</code>.</li><li>Send both samples in <code>_arg1</code> and use <code>_arg2</code> to specify which sample each observation should be included in. For example,&nbsp; <code>_arg1</code> could be <code>[1, 4, 1, 3, 4, 5]</code> and <code>_arg2</code> be <code>[‘yes’,’yes’,’yes’, ’no’,’no’,’no’]</code>.</li></ol>



<p>When the table calculation was set to use table(down), <code>_arg2</code> had both the value <code>Union=no</code> and <code>Union=yes</code>, but now that we are using cell we have <strong>two </strong>calls to TabPy, one for <code>Union=no</code> and a second for <code>Union=yes</code>. Instead of sending <code>_arg1 =&nbsp; [1, 2, 1, 5, 3, 4, 5, 1]</code> <code>_arg2= [‘yes’,’yes’,’yes’,’no’,’no’,’no’]</code>, we are sending&nbsp; <code>_arg1 = [1, 4, 1]</code> and <code>_arg2 = [‘yes’,’yes’,’yes’]</code> with one call to TabPy and then making a second call with <code>_arg1 =&nbsp; [4, 5, 1]</code> and <code>_arg2=[‘no’,’no’,’no’]</code>. As a result, in <code>ttest.py</code> <code>len(set(_arg2)) == 2</code> evaluates to false, and we end up at line 34, which throws an error.&nbsp;</p>



<p>We can troubleshoot similar errors by checking the command prompt to find the error message and the line number that is throwing the error.</p>



<p> </p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="$100/h+ Tableau Freelancers on Upwork" width="937" height="527" src="https://www.youtube.com/embed/q7awpkX8LN8?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p><a href="https://blog.finxter.com/become-python-freelancer-course/" data-type="page" data-id="2072" target="_blank" rel="noreferrer noopener">Become a Freelance Developer today!</a></p>



<p></p>



<p></p>



<p></p>
<p>The post <a href="https://blog.finxter.com/execute-python-from-tableau-with-tabpy/">Execute Python from Tableau with TabPy</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Minified using Disk

Served from: blog.finxter.com @ 2026-04-27 05:48:38 by W3 Total Cache
-->