Lukas Halim, Author at Be on the Right Side of Change

Matplotlib Text and Annotate — A Simple Guide

Lukas Halim — Sat, 22 May 2021 20:02:22 +0000

You’d like to add text to your plot, perhaps to explain an outlier or label points. Matplotlib‘s text method allows you to add text as specified coordinates. But if you want the text to refer to a particular point, but you don’t want the text centered on that point? Often you’ll want the text slightly below or above the point it’s labeling. In that situation, you’ll want the annotate method. With annotate, we can specify both the point we want to label and the position for the label.

Basic text method example

Let’s start with an example of the first situation – we simply want to add text at a particular point on our plot. The text method will place text anywhere you’d like on the plot, or even place text outside the plot. After the import statement, we pass the required parameters – the x and y coordinates and the text.

import matplotlib.pyplot as plt

x, y, text = .5, .5, "text on plot"

fig, ax = plt.subplots()
ax.text(x, y, text)
x, y, text = 1.3, .5, "text outside plot"
ax.text(x, y, text)

Text(1.3, 0.5, 'text outside plot')

Changing the font size and font color

We can customize the text position and format using optional parameters. The font itself can be customized using either a fontdict object or with individual parameters.

x, y, text = .3, .5, "formatted with fontdict"
fontdict = {'family': 'serif', 'weight': 'bold', 'size': 16, 'color' : 'green'}
fig, ax = plt.subplots()
ax.text(x, y, text, fontdict=fontdict)
x, y, text = .2, .2, "formatted with individual parameters"
ax.text(x, y, text, fontsize = 12, color = 'red', fontstyle = 'italic')

Text(0.2, 0.2, 'formatted with individual parameters')

How to change the text alignment?

We specify the xy coordinates for the text, but of course, the text can’t fit on a single point. So is the text centered on the point, or is the first letter in the text positioned on that point? Let’s see.

fig, ax = plt.subplots()
ax.set_title("Different horizonal alignment options when x = .5")
ax.text(.5, .8, 'ha left', fontsize = 12, color = 'red', ha = 'left')
ax.text(.5, .6, 'ha right', fontsize = 12, color = 'green', ha = 'right')
ax.text(.5, .4, 'ha center', fontsize = 12, color = 'blue', ha = 'center')
ax.text(.5, .2, 'ha default', fontsize = 12)

Text(0.5, 0.2, 'ha default')

The text is left horizontal aligned by default. Left alignment positions the beginning of the text is on the specified coordinates. Center alignment positions the middle of the text on the xy coordinates. Right alignment positions the end of the text on the coordinates.

Creating a text box

The fontdict dictionary object allows you to customize the font. Similarly, passing the bbox dictionary object allows you to set the properties for a box around the text. Color values between 0 and 1 determine the shade of gray, with 0 being totally black and 1 being totally white. We can also use boxstyle to determine the shape of the box. If the facecolor is too dark, it can be lightened by trying a value of alpha closer to 0.

fig, ax = plt.subplots()
x, y, text = .5, .7, "Text in grey box with\nrectangular box corners."
ax.text(x, y, text,bbox={'facecolor': '.9', 'edgecolor':'blue', 'boxstyle':'square'})
x, y, text = .5, .5, "Text in blue box with\nrounded corners and alpha of .1."
ax.text(x, y, text,bbox={'facecolor': 'blue', 'edgecolor':'none', 'boxstyle':'round', 'alpha' : 0.05})
x, y, text = .1, .3, "Text in a circle.\nalpha of .5 darker\nthan alpha of .1"
ax.text(x, y, text,bbox={'facecolor': 'blue', 'edgecolor':'black', 'boxstyle':'circle', 'alpha' : 0.5})

Text(0.1, 0.3, 'Text in a circle.\nalpha of .5 darker\nthan alpha of .1')

Basic annotate method example

Like we said earlier, often you’ll want the text to be below or above the point it’s labeling. We could do this with the text method, but annotate makes it easier to place text relative to a point. The annotate method allows us to specify two pairs of coordinates. One xy coordinate specifies the point we wish to label. Another xy coordinate specifies the position of the label itself. For example, here we plot a point at (.5,.5) but put the annotation a little higher, at (.5,.503).

fig, ax = plt.subplots()
x, y, annotation = .5, .5, "annotation"
ax.title.set_text = "Annotating point (.5,.5) with label located at (.5,.503)"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.003))

Text(0.5, 0.503, 'annotation')

Annotate with an arrow

Okay, so we have a point at xy and an annotation at xytext. How can we connect the two? Can we draw an arrow from the annotation to the point? Absolutely! What we’ve done with annotate so far looks the same as if we’d just used the text method to put the point at (.5, .503). But annotate can also draw an arrow connecting the label to the point. The arrow is styled by passing a dictionary to arrowprops.

fig, ax = plt.subplots()
x, y, annotation = .5, .5, "annotation"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.003),arrowprops={'arrowstyle' : 'simple'})

Text(0.5, 0.503, 'annotation')

Adjusting the arrow length

It looks a little weird to have the arrow touch the point. How can we have the arrow go close to the point, but not quite touch it? Again, styling options are passed in a dictionary object. Larger values from shrinkA will move the tail further from the label and larger values of shrinkB will move the head farther from the point. The default for shrinkA and shrinkB is 2, so by setting shrinkB to 5 we move the head of the arrow further from the point.

fig, ax = plt.subplots()
x, y, annotation = .5, .5, "annotation"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.003),arrowprops={'arrowstyle' : 'simple', 'shrinkB' : 5})

Text(0.5, 0.503, 'annotation')

Do the annotate and text methods have the same styling options?

Yes, all the parameters that work with text will also work with annotate. So, for example, we can put the annotation in a text box and set the fontstyle as italic, the same way as we did above.

fig, ax = plt.subplots()
x, y, text = .5, .7, "Italic text in grey box with\nrectangular box corner\ndemonstrating that the\nformatting options\nthat work with text\nalso work with annotate."
ax.scatter(x,y)
ax.annotate(text, xy=(x,y),xytext=(x,y+.01)
            ,fontstyle = 'italic'
            ,bbox={'facecolor': '.9', 'edgecolor':'blue', 'boxstyle':'square', 'alpha' : 0.5}
            ,arrowprops={'arrowstyle' : 'simple', 'shrinkB' : 5})

Text(0.5, 0.71, 'Italic text in grey box with\nrectangular box corner\ndemonstrating that the\nformatting options\nthat work with text\nalso work with annotate.')

Are there any shorthands for styling the arrow?

Yes, arrowstyle can be used instead of the other styling keys. More options here including 'fancy', 'simple', '-' and '->'.

fig, ax = plt.subplots()
x, y, annotation = .5, .5, "wedge style"
ax.scatter(x,y)
ax.annotate(annotation,xy=(x,y),xytext=(x,y+.01),arrowprops={'arrowstyle':'wedge'})
another_annotation = '- style'
ax.annotate(another_annotation,xy=(x,y),xytext=(x,y-.01),arrowprops={'arrowstyle':'-'})

Text(0.5, 0.49, '- style')

How can we annotate all the points on a scatter plot?

We can first create 15 test points with associated labels. Then loop through the points and use the annotate method at each point to add a label.

import random
random.seed(2)

x = range(15)
y = [element * (2 + random.random()) for element in x]
n = ['label for ' + str(i) for i in x]

fig, ax = plt.subplots()
ax.scatter(x, y)

texts = []
for i, txt in enumerate(n):
    ax.annotate(txt, xy=(x[i], y[i]), xytext=(x[i],y[i]+.3))

Handling overlapping annotations

The annotations are overlapping each other. How do we prevent that? You could manually adjust the location of each label, but that would be very time-consuming. Luckily the python library adjustText will do the work for us. You’ll have to pip install it first, and we’ll need to store the annotations in a list so that we can pass them as an argument to adjust_text. Doing this, we see for example that “label for 6” is shifted to the left so that it no longer overlaps with “label for 7.”

from adjustText import adjust_text

fig, ax = plt.subplots()
ax.scatter(x, y)

texts = []
for i, txt in enumerate(n):
    texts.append(ax.annotate(txt, xy=(x[i], y[i]), xytext=(x[i],y[i]+.3)))
    
adjust_text(texts)

Conclusion

You should now be able to position and format text and annotations on your plots. Thanks for reading! Please check out my other work at LearningTableau, PowerBISkills, and DataScienceDrills.

The post Matplotlib Text and Annotate — A Simple Guide appeared first on Be on the Right Side of Change.

Logistic Regression Scikit-learn vs Statsmodels

Lukas Halim — Fri, 05 Feb 2021 15:44:50 +0000

What’s the difference between Statsmodels and Scikit-learn? Both have ordinary least squares and logistic regression, so it seems like Python is giving us two ways to do the same thing. Statsmodels offers modeling from the perspective of statistics. Scikit-learn offers some of the same models from the perspective of machine learning.

So we need to understand the difference between statistics and machine learning! Statistics makes mathematically valid inferences about a population based on sample data. Statistics answers the question, “What is the evidence that X is related to Y?” Machine learning has the goal of optimizing predictive accuracy rather than inference. Machine learning answers the question, “Given X, what prediction should we make for Y?”

In the example below, we’ll create a fake dataset with predictor variables and a binary Y variable. Then we’ll perform logistic regression with scikit-learn and statsmodels. We’ll see that scikit-learn allows us to easily tune the model to optimize predictive power. Statsmodels will provide a summary of statistical measures which will be very familiar to those who’ve used SAS or R.

If you need an intro to Logistic Regression, see this Finxter post.

Create Fake Data for the Logistic Regression Model

I tried using some publicly available data for this exercise but didn’t find one with the characteristics I wanted. So I decided to create some fake data by using NumPy! There’s a post here that explains the math and how to do this in R.

import numpy as np
import pandas as pd

#The next line is setting the seed for the random number generator so that we get consistent results
rg = np.random.default_rng(seed=0)
#Create an array with 500 rows and 3 columns
X_for_creating_probabilities = rg.normal(size=(500,3))

Create an array with the first column removed. The deleted column can be thought of as random noise, or as a variable that we don’t have access to when creating the model.

X1 = np.delete(X_for_creating_probabilities,0,axis=1)
X1[:5]
"""
array([[-0.13210486,  0.64042265],
       [-0.53566937,  0.36159505],
       [ 0.94708096, -0.70373524],
       [-0.62327446,  0.04132598],
       [-0.21879166, -1.24591095]])
"""

Now we’ll create two more columns correlated with X1. Datasets often have highly correlated variables. Correlation increases the likelihood of overfitting. Concatenate to get a single array.

X2 = X1 + .1 * np.random.normal(size=(500,2))
X_predictors = np.concatenate((X1,X2),axis=1)

We want to create our outcome variable and have it be related to X_predictors. To do that, we use our data as inputs to the logistic regression model to get probabilities. Then we set the outcome variable, Y, to True when the probability is above .5.

P = 1 / (1 + np.e**(-np.matmul(X_for_creating_probabilities,[1,1,1])))
Y = P > .5
#About half of cases are True
np.mean(Y)
#0.498

Now divide the data into training and test data. We’ll run a logistic regression on the training data, then see how well the model performs on the training data.

#Set the first 50 rows to train the model
X_train = X_predictors[:50]
Y_train = Y[:50]

#Set the remaining rows to test the model
X_test = X_predictors[50:]
Y_test = Y[50:]

print(f"X_train: {len(X_train)} X_test: {len(X_test)}")
#X_train: 50 X_test: 450

Logistic regression with Scikit-learn

We’re ready to train and test models.

As we train the models, we need to take steps to avoid overfitting. A machine learning model may have very accurate results with the data used to train the model. But this does not mean it will be equally accurate when making predictions with data it hasn’t seen before. When the model fails to generalize to new data, we say it has “overfit” the training data. Overfitting is more likely when there are few observations to train on, and when the model uses many correlated predictors.

How to avoid overfitting? By default, scikit-learn‘s logistic regression applies regularization. Regularization balances the need for predictive accuracy on the training data with a penalty on the magnitude of the model coefficients. Increasing the penalty reduces the coefficients and hence reduces the likelihood of overfitting. If the penalty is too large, though, it will reduce predictive power on both the training and test data.

from sklearn.linear_model import LogisticRegression
scikit_default = LogisticRegression(random_state=0).fit(X_train, Y_train)
print(f"intecept: {scikit_default.intercept_} coeficients: {scikit_default.coef_}")
print(f"train accuracy: {scikit_default.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_default.score(X_test, Y_test)}")
"""
Results will vary slightly, even when you set random_state.
intecept: [-0.44526823] coeficients: [[0.50031563 0.79636504 0.82047214 0.83635656]]
train accuracy: 0.8
test accuracy: 0.8088888888888889
"""

We can set turn off regularization by setting penalty as none. Applying regularization reduces the magnitude of the coefficients. Setting the penalty to none will increase the coefficients. Notice that the accuracy on the test data decreases. This indicates our model has overfit the training data.

from sklearn.linear_model import LogisticRegression
scikit_no_penalty = LogisticRegression(random_state=0,penalty='none').fit(X_train, Y_train)
print(f"intecept: {scikit_no_penalty.intercept_} coeficients: {scikit_no_penalty.coef_}")
print(f"train accuracy: {scikit_no_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_no_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.63388911] coeficients: [[-3.59878438  0.70813119  5.10660019  1.29684873]]
train accuracy: 0.82
test accuracy: 0.7888888888888889
"""

C is 1.0 by default. Smaller values of C increase the regularization, so if we set the value to .1 we reduce the magnitude of the coefficients.

from sklearn.linear_model import LogisticRegression
scikit_bigger_penalty = LogisticRegression(random_state=0,C=.1).fit(X_train, Y_train)
print(f"intecept: {scikit_bigger_penalty.intercept_} \
    coeficients: {scikit_bigger_penalty.coef_}")
print(f"train accuracy: {scikit_bigger_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_bigger_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.13102803]     coeficients: [[0.3021235  0.3919277  0.34359251 0.40332636]]
train accuracy: 0.8
test accuracy: 0.8066666666666666
"""

It’s nice to be able to adjust the smoothing coefficient, but how do we decide the optimal value? Scikit-learn’s GridSearchCV provides an effective but easy to use method for choosing an optimal value. The “Grid Search” in GridSearchCV means that we supply a dictionary with the parameter values we wish to test. The model is fit with all combinations of those values. If we have 4 possible values for C and 2 possible values for solver, we will search through all 4X2=8 combinations.

GridSearchCV Searches Through This Grid

C	solver
.01	newton-cg
.1	newton-cg
1	newton-cg
10	newton-cg
.01	lbfgs
.1	lbfgs
1	lbfgs
10	lbfgs

The “CV” in GridSearchCV stands for cross-validation. Cross-validation is the method of segmenting the training data. The model is trained on all but one of the segments and the remaining segment validate the model.

Iteration	Segment 1	Segment 2	Segment 3	Segment 4	Segment 5
1st Iteration	Validation	Train	Train	Train	Train
2nd Iteration	Train	Validation	Train	Train	Train
3rd Iteration	Train	Train	Validation	Train	Train
4th Iteration	Train	Train	Train	Validation	Train
5th Iteration	Train	Train	Train	Train	Validation

GridSearch and cross-validation work in combination. GridsearchCV iterates through values of C and solver for different test and training segments. The algorithm selects the best estimator based performance on the validation segments.

Doing this allows us to determine which values of C and solver work best for our training data. This is how scikit-learn helps us to optimize predictive accuracy.

Let’s see it in action.

from sklearn.model_selection import GridSearchCV
parameters = {'C':[.01, .1, 1, 10],'solver':['newton-cg','lbfgs']}
Logistic = LogisticRegression(random_state=0)
scikit_GridSearchCV = GridSearchCV(Logistic, parameters)
scikit_GridSearchCV.fit(X_train, Y_train)
print(f"best estimator: {scikit_GridSearchCV.best_estimator_}")
#best estimator: LogisticRegression(C=0.1, random_state=0, solver='newton-cg')

Use the score method returns the mean accuracy on the given test data and labels. Accuracy is the percent of observations correctly predicted.

print(f"train accuracy: {scikit_GridSearchCV.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_GridSearchCV.score(X_test, Y_test)}")
"""
train accuracy: 0.82
test accuracy: 0.8133333333333334
"""

Logistic regression with Statsmodels

Now let’s try the same, but with statsmodels. With scikit-learn, to turn off regularization we set penalty='none', but with statsmodels regularization is turned off by default. A quirk to watch out for is that Statsmodels does not include an intercept by default. To include an intercept, we use the sm.add_constant method.

import statsmodels.api as sm

#adding constant to X
X_train_with_constant = sm.add_constant(X_train)
X_test_with_constant = sm.add_constant(X_test)

# building the model and fitting the data
sm_model_all_predictors = sm.Logit(Y_train, X_train_with_constant).fit()

# printing the summary table
print(sm_model_all_predictors.params)
"""
Optimization terminated successfully.
         Current function value: 0.446973
         Iterations 7
[-0.57361523 -2.00207425  1.28872367  3.53734636  0.77494424]
"""

If you’re used to doing logistic regression in R or SAS, what comes next will be familiar. Once we have trained the logistic regression model with statsmodels, the summary method will easily produce a table with statistical measures including p-values and confidence intervals.

sm_model_all_predictors.summary()

Dep. Variable:	y	No. Observations:	50
Model:	Logit	Df Residuals:	45
Method:	MLE	Df Model:	4
Date:	Thu, 04 Feb 2021	Pseudo R-squ.:	0.3846
Time:	14:33:19	Log-Likelihood:	-21.228
converged:	True	LL-Null:	-34.497
Covariance Type:	nonrobust	LLR p-value:	2.464e-05

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.7084	0.478	-1.482	0.138	-1.645	0.228
x1	5.5486	4.483	1.238	0.216	-3.237	14.335
x2	10.2566	5.686	1.804	0.071	-0.887	21.400
x3	-3.9137	4.295	-0.911	0.362	-12.333	4.505
x4	-7.8510	5.364	-1.464	0.143	-18.364	2.662

There’s a lot here, but we’ll focus on the second table with the coefficients.

The first column shows the value for the coefficient. The fourth column, with the heading P>|z|, shows the p-values. A p-value is a probability measure, and p-values above .05 are frequently considered, “not statistically significant.” None of the predictors are considered statistically significant! This is because we have a relatively small number of observations in our training data and because the predictors are highly correlated. Some statistical packages like R and SAS have built-in methods to select the features to include in the model based on which predictors have low (significant) p-values, but unfortunately, this isn’t available in statsmodels.

If we try again with just x1 and x2, we’ll get a completely different result, with very low p-values for x1 and x2, meaning that the evidence for a relationship with the dependent variable is statistically significant. We’re cheating, though – because we created the data, we know that we only need x1 and x2.

sm_model_x1_x2 = sm.Logit(Y_train, X_train_with_constant[:,:3]).fit()
sm_model_x1_x2.summary()

Now we see x1 and x2 are both statistically significant.

Statsmodels doesn’t have the same accuracy method that we have in scikit-learn. We’ll use the predict method to predict the probabilities. Then we’ll use the decision rule that probabilities above .5 are true and all others are false. This is the same rule used when scikit-learn calculates accuracy.

all_predicted_train = sm_model_all_predictors.predict(X_train_with_constant)>.5
all_predicted_test = sm_model_all_predictors.predict(X_test_with_constant)>.5

x1_x2_predicted_train = sm_model_x1_x2.predict(X_train_with_constant[:,:3])>.5
x1_x2_predicted_test = sm_model_x1_x2.predict(X_test_with_constant[:,:3])>.5

#calculate the accuracy
print(f"train: {(Y_train==all_predicted_train).mean()} and test: {(Y_test==all_predicted_test).mean()}")
print(f"train: {(Y_train==x1_x2_predicted_train).mean()} and test: {(Y_test==x1_x2_predicted_test).mean()}")
"""
train: 0.8 and test: 0.8066666666666666
train: 0.8 and test: 0.8111111111111111
"""

Summarizing The Results

Let’s create a DataFrame with the results. The models have identical accuracy on the training data, but different results on the test data. The models with all the predictors and without smoothing have the worst test accuracy, suggesting that they have overfit on the training data and so do not generalize well to new data.

Even if we use the best methods in creating our model, there is still chance involved in how well it generalizes to the test data.

lst = [['scikit-learn','default', scikit_default.score(X_train, Y_train),scikit_default.score(X_test, Y_test)],
       ['scikit-learn','no penalty', scikit_no_penalty.score(X_train, Y_train),scikit_no_penalty.score(X_test, Y_test)],
       ['scikit-learn','bigger penalty', scikit_bigger_penalty.score(X_train, Y_train),scikit_bigger_penalty.score(X_test, Y_test)],
       ['scikit-learn','GridSearchCV', scikit_GridSearchCV.score(X_train, Y_train),scikit_GridSearchCV.score(X_test, Y_test)],
       ['statsmodels','include intercept and all predictors', (Y_train==all_predicted_train).mean(),(Y_test==all_predicted_test).mean()],
       ['statsmodels','include intercept and x1 and x2', (Y_train==x1_x2_predicted_train).mean(),(Y_test==x1_x2_predicted_test).mean()]
      ]
df = pd.DataFrame(lst, columns =['package', 'setting','train accuracy','test accuracy'])
df

	package	setting	train accuracy	test accuracy
0	scikit-learn	default	0.80	0.808889
1	scikit-learn	no penalty	0.78	0.764444
2	scikit-learn	bigger penalty	0.82	0.813333
3	scikit-learn	GridSearchCV	0.80	0.808889
4	statsmodels	include intercept and all predictors	0.78	0.764444
5	statsmodels	include intercept and x1 and x2	0.80	0.811111

Scikit-learn vs Statsmodels

Upshot is that you should use Scikit-learn for logistic regression unless you need the statistics results provided by StatsModels.

Here’s a table of the most relevant similarities and differences:

	Scikit-learn	Statsmodels
Regularization	Uses L2 regularization by default, but regularization can be turned off using penalty=’none’	Does not use regularization by default
Hyperparameter tuning	GridSearchCV allows for easy tuning of regularization parameter	User will need to write lines of code to tune regularization parameter
Intercept	Includes intercept by default	Use the add_constant method to include an intercept
Model Evaluation	The score method reports prediction accuracy	The summary method shows p-values, confidence intervals, and other statistical measures
When should you use it?	For accurate predictions	For statistical inference.
Comparison with R and SAS	Different	Similar

That’s it for now! Please check out my other work at learningtableau.com and my new site datasciencedrills.com.

The post Logistic Regression Scikit-learn vs Statsmodels appeared first on Be on the Right Side of Change.

Execute Python from Tableau with TabPy

Lukas Halim — Sun, 13 Dec 2020 16:26:21 +0000

Are you trying to understand how to call Python code from Tableau? Maybe you tried other online resources but ran into frustrating errors. This TabPy tutorial will show you how to get the TabPy installed and setup, and will get you running Python code in Tableau.

Installing Tableau Desktop

If you need Tableau Desktop, you can get a 14-day trial here: https://www.tableau.com/products/desktop/download

Note: Tableau Public, the free license version of Tableau, does not support Python integration.

TabPy Installation

Reading the documentation, this should be as simple as:

pip install tabpy

Perhaps this will be all you need to get TabPy installed. But when I tried the install failed. This was due to a failure to install on one of the dependencies, a Python package called Twist. A search on StackOverflow leads to this solution (https://stackoverflow.com/questions/36279141/pip-doesnt-install-twisted-on-windows) and to this unofficial Windows binary available at (http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted). I downloaded the appropriate binary for my version of Python, navigated to the download directory, and installed with this command:

pip install Twisted-20.3.0-cp38-cp38-win_amd64.whl

That installed Twist, and I was then able to install TabPy as expected.

TabPy Setup

With TabPy installed, starting the TabPy server can be done from the command prompt:

TabPy

You should see a message like the one below, telling you that the web service is listening on port 9004:

With TabPy running, start Tableau Desktop.

In Tableau Desktop, click Help on the toolbar, then Settings and Performance > Manage Analytics Extension Connection.

Then select TabPy/External API, select localhost for the server, and set the port to 9004

TabPy Examples

The first example shows how to use a NumPy function on aggregated data to calculate the Pearson correlation coefficient.

The second example shows how to use a TabPy deployed function to do a t-test on disaggregated data.

Example – Correlation on Aggregated Data

We have TabPy running and Tableau’s analytics extension configured. Now we’ll call Python code from Tableau.

Downloaded data on the wages and education of young males (https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Males.csv) and open using the Connect to Text File option.

Select Sheet1 to start a new worksheet.

Maried is spelled without the second ‘r’, so right-click on the field and rename it to “Married.”

Drag “Married” and “Experience” to the row shelf, and double-click on Exper and Wage:

Next, change SUM(Exper) to AVG(Exper) and SUM(Wage) to AVG(Exper):

The view should now look like this:

Now let’s add a calculation with some Python code! You can create a calculation by clicking on the Analysis tab on the toolbar and then “Create Calculated Field”

Call the calculation “TabPy Corr” and use this expression:

SCRIPT_REAL("import numpy as np
print(f'arg1_: {_arg1}')
print(f'arg2_: {_arg2}')
print(f'return: {np.corrcoef(_arg1,_arg2)[0,1]}')
return np.corrcoef(_arg1,_arg2)[0,1]",avg([Exper]),avg([Wage])
)

The print statements allow us to see the data exchange between Tableau and the TabPy server. Switch to the command prompt to see:

Tableau is sending two lists, _arg1 and _arg2, to the TabPy server. _arg1 is a list with the values from avg([Exper]) and _arg2 is a list with the values from avg([Wage]).

TabPy returns a single value representing the correlation of avg([Exper]) and avg([Wage]).

We return np.corrcoef(_arg1,_arg2)[0,1] instead of just np.corrcoef(_arg1,_arg2) because np.corrcoef(_arg1,_arg2) returns a 2×2 correlation matrix, but Tableau expects either a single value or a list of values with the same length as _arg1 and _arg2. If we return a the 2×2 matrix, Tableau will give us the error message, “TypeError : Object of type ndarray is not JSON serializable“

The functions used to communicate with the TabPy server, SCRIPT_REAL, SCRIPT_INT, SCRIPT_BOOL and SCRIPT_STR are “table calculations,” which means that the input parameters must be aggregated. For example, AVG([Exper]) is an acceptable parameter, but [Exper] is not. Table calculations work not on the data in the underlying dataset (Males.csv for this example) but on the values aggregated to the level shown in the Tableau worksheet. Tableau sends TabPy lists with the aggregated values.

We use SCRIPT_REAL rather than one of the other SCRIPT_* functions because our function will return a float. If, for example, the function was instead returning a string, we would use SCRIPT_STR.

One call is made from Tableau to TabPy for each partition in the table calculation. The default is Table(down) which uses a single partition for the entire table:

We can change the partition by selecting edit then table calculation:

Currently, the Table Calculation is computed using Table(down), which means that Tableau goes down all of the rows in the Table. You can see that all of the values are highlighted in yellow.

If we change from Table(down) to Pane(down) the table calculation will be done separately for each pane. The rows of the table are divided into two panes – one for married = no and another for married=yes. Therefore, there are two separate calls to TabPy, one for maried no and a second for maried=yes. Each call gets a separate response.

We can see the exchange of data by switching back to the command prompt:

The print statements show what is happening. The first call to TabPy represents the partition where married=no. Lists are sent with the average wage and experience values and the value returned is -0.3382. The second call represents the partition where married=yes, the related average wage and experience values are sent, and the function returns -0.0120. Tableau displays the results.

We called Python code from Tableau and used the results in our worksheet. Excellent!

But we could have done the same thing much more easily without Python by using Tableau’s WINDOW_CORR function:

We can add this to the view and see that it gives the same results using either Table(down) or Pane(down):

This example is great for understanding TabPy. But we don’t need to use Python to calculate correlation since Python already has WINDOW_CORR built-in.

Example – Two-sample T-Test Disaggregated Data

If our data represents a sample of the general male population, then we can use statistics to make inferences about the population based on our sample. For example, we might want to ask whether our sample gives evidence that males in the general population who are unionized have more experience than those who are not. The test for this is a two-sample t-test. You can learn more about it here: (https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing).

Unlike correlation, Tableau does not have a built-in t-test. So we will use Python to do a t-test.

But first, we will set up a new worksheet. The documentation here (https://github.com/tableau/TabPy/blob/master/docs/tabpy-tools.md#t-test) explains what we need to pass to the t-test function. We need to pass _arg1 with the years of experience and _arg2 as the categorical variable that maps each observation to either sample1 (Union=yes) or sample2 (Union=no).

Let’s start by creating a new view with Union on the row shelf and AVG(Exper) on the column shelf:

Disaggregate measures by unchecking:

With aggregate measures unchecked, AVG(Exper) should change to Exper. Use the “Show me” menu to change to a box-and-whisker plot:

Our view is set, except for the t-test. The t-test is one of the models included with TabPy, explained here (https://github.com/tableau/TabPy/blob/master/docs/tabpy-tools.md#predeployed-functions). We need to run a command before we can run t-tests. With the TabPy server running, open a second command prompt and enter the following command:

tabpy-deploy-models

You should see a result like this:

If it’s successful, you can now call anova, PCA, Sentiment Analysis, and t-tests from Tableau!

Create a new calculation, “Union Exper Ttest,” which will determine whether there is a statistically significant difference in average experience for the unionized compared with the non-unionized.

SCRIPT_REAL("print(f'unique values: {len(set(_arg2))}')
return tabpy.query('ttest',_arg1,_arg2)['response']"
,avg([Exper]),attr([Union]))

Because SCRIPT_REAL is a table calculation the parameters have to be aggregated (using avg and attr) but with the “aggregate measures” unchecked the view is showing individual observations from Males.csv anyway, the individual values are passed to TabPy.

Drag the new calculation to the tooltip to show it in the view:

The t-test returns a p-value of 0.4320. We can interpret this to mean that we do not find evidence for a difference in average years experience for unionized versus non-unionized males. The average experience in our sample data is different for unionized men compared with non-unionized men, but because the p-value is high we don’t have evidence of a difference in the general population..

Tableau does not have a t-test built-in, but we have added it using Python!

Troubleshooting

You’re very likely to encounter errors when setting up calculations with TabPy. Here’s an example. If we try switching the table calculation from Table(down) to Cell, we get this message:

_arg1 and _arg2 are lists, so what’s the problem? The error message we see in Tableau doesn’t help us to pinpoint the problem. If we switch to the command prompt, we can see the stack trace:

The stack trace tells us that line 34 is throwing the error. We can look at the ttest.py code here https://github.com/tableau/TabPy/blob/master/tabpy/models/scripts/tTest.py to better understand the error.

The problem is that if we are doing a two-sample t-test, we can do it in one of two ways:

Send _arg1 and _arg2 as the two different samples. For example, _arg1 could be [1, 4, 1] and _arg2 be [3, 4, 5].
Send both samples in _arg1 and use _arg2 to specify which sample each observation should be included in. For example, _arg1 could be [1, 4, 1, 3, 4, 5] and _arg2 be [‘yes’,’yes’,’yes’, ’no’,’no’,’no’].

When the table calculation was set to use table(down), _arg2 had both the value Union=no and Union=yes, but now that we are using cell we have two calls to TabPy, one for Union=no and a second for Union=yes. Instead of sending _arg1 = [1, 2, 1, 5, 3, 4, 5, 1] _arg2= [‘yes’,’yes’,’yes’,’no’,’no’,’no’], we are sending _arg1 = [1, 4, 1] and _arg2 = [‘yes’,’yes’,’yes’] with one call to TabPy and then making a second call with _arg1 = [4, 5, 1] and _arg2=[‘no’,’no’,’no’]. As a result, in ttest.py len(set(_arg2)) == 2 evaluates to false, and we end up at line 34, which throws an error.

We can troubleshoot similar errors by checking the command prompt to find the error message and the line number that is throwing the error.

Become a Freelance Developer today!

The post Execute Python from Tableau with TabPy appeared first on Be on the Right Side of Change.