<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Scikit-learn Library Archives - Be on the Right Side of Change</title>
	<atom:link href="https://blog.finxter.com/category/scikit-learn-library/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.finxter.com/category/scikit-learn-library/</link>
	<description></description>
	<lastBuildDate>Mon, 11 Oct 2021 15:15:31 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.finxter.com/wp-content/uploads/2020/08/cropped-cropped-finxter_nobackground-32x32.png</url>
	<title>Scikit-learn Library Archives - Be on the Right Side of Change</title>
	<link>https://blog.finxter.com/category/scikit-learn-library/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>How to Develop LARS Regression Models in Python?</title>
		<link>https://blog.finxter.com/how-to-develop-lars-regression-models-in-python/</link>
		
		<dc:creator><![CDATA[Gábor Madarász]]></dc:creator>
		<pubDate>Wed, 06 Oct 2021 16:27:58 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<category><![CDATA[sklearn]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=36383</guid>

					<description><![CDATA[<p>What is LARS regression? Regression is the analysis of how a variable (the outcome variable) depends on the evolution of other variables (explanatory variables). In regression, we are looking for the answer to the question of what is the function that can be used to predict the value of another variable Y by knowing the ... <a title="How to Develop LARS Regression Models in Python?" class="read-more" href="https://blog.finxter.com/how-to-develop-lars-regression-models-in-python/" aria-label="Read more about How to Develop LARS Regression Models in Python?">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-to-develop-lars-regression-models-in-python/">How to Develop LARS Regression Models in Python?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">What is LARS regression?</h2>



<p>Regression is the analysis of how a variable (the <em><strong>outcome </strong></em>variable) depends on the evolution of other variables (<em><strong>explanatory </strong></em>variables). </p>



<p>In regression, we are looking for the answer to the question of <strong>what is the function that can be used to predict the value of another variable Y by knowing the value of one variable X?</strong> </p>



<p>In general, regression calculations are based on the assumption that a causal and statistical relationship can be assumed or deduced between certain variables. To describe the causal relationship, we look for a functional relationship between the variables, i.e., we consider the cause as a dependent variable and the other influencing variables as independent variables. </p>



<p><a href="https://blog.finxter.com/python-linear-regression-1-liner/" title="Python Linear Regression with sklearn – A Helpful Illustrated Guide" target="_blank" rel="noreferrer noopener">Linear regression</a> is a parametric regression model that assumes a linear relationship between the explanatory (X) and the explained (Y) variables (in terms of parameters). This means that in estimating linear regression, we try to fit a line to the point cloud of the sample data. </p>



<div class="wp-block-image"><figure class="aligncenter size-full"><img fetchpriority="high" decoding="async" width="384" height="288" src="https://blog.finxter.com/wp-content/uploads/2021/10/image-19.png" alt="" class="wp-image-36615" srcset="https://blog.finxter.com/wp-content/uploads/2021/10/image-19.png 384w, https://blog.finxter.com/wp-content/uploads/2021/10/image-19-300x225.png 300w" sizes="(max-width: 384px) 100vw, 384px" /><figcaption><strong>Fig 1</strong>: Linear regression of random numbers with noise</figcaption></figure></div>



<p></p>



<p>Least angle regression (LARS) is a relatively new technique that is a variant of <strong><em>forward regression</em></strong>.</p>



<p>It starts all coefficients with zero and find the predictor (x1) which correlates best with the response.</p>



<p>We move towards this predictor until another predictor, (x2), shows the same degree of correlation with the current residual. The LARS moves in a direction at an equal angle between the two predictors, until a third variable, (x3), again shows the same degree of correlation with the current residual. The LARS then moves at equal angles between x1, x2, and x3, (i.e., in the direction of least angle), until the next variable enters and so on.</p>



<div class="wp-block-image"><figure class="aligncenter size-full"><img decoding="async" width="281" height="165" src="https://blog.finxter.com/wp-content/uploads/2021/10/image-20.png" alt="" class="wp-image-36617"/><figcaption><strong>Fig 2:</strong> The original graph from the Stanford University how LARS works.</figcaption></figure></div>



<h2 class="wp-block-heading">What Is LARS For?</h2>



<p>This technique is used for forecasting, modeling time series, and establishing cause and effect relationships between variables. There are several advantages to using regression analysis.</p>



<p>It indicates significant relationships between the dependent variable and the independent variable.</p>



<p>It shows the strength of the effect of several independent variables on the dependent variable.</p>



<p>Regression analysis also allows you to compare the effects of variables measured at different scales. These advantages help data scientists to eliminate and evaluate the best set of variables to use in building predictive models.</p>



<p><strong>The advantages of LARS </strong>are:</p>



<ul class="wp-block-list"><li>As fast as stepwise regression .</li><li>Generates a complete piecewise linear solution path , useful for cross-validation or similar model fitting experiments .</li><li>If two variables are related to the dependent variable to nearly the same extent , their coefficients should increase at about the same rate . So the algorithm is more stable.</li></ul>



<h2 class="wp-block-heading">LARS in Python</h2>



<p>To solve this problem, we use the &#8220;<a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html" target="_blank" rel="noreferrer noopener" title="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html">sklearn.linear_model.Lars</a>&#8221; class of the <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">scikit </a>learn library.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from pandas import read_excel
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

# import data
dataframe = read_excel('AirQualityUCI.xls')
# clean dataframe
dataframe = dataframe[(dataframe > 0).all(axis=1)]
data = dataframe.values

# select relevant data
x, y = data[:, 11:12], data[:, 10:11]
# print(x.shape, y.shape, type(x), type(y))
# split the arrays into random train and test subsets
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.1)

# model fitting
lars = linear_model.Lars().fit(xtrain, ytrain)

# predict
ypred = lars.predict(xtest)

# measure errors
print(lars.coef_)
mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
mae = mean_absolute_error(ytest, ypred)
print("MAE: %.2f" % mae)


# plot original vs predicted data
x_ax = range(len(ytest))
plt.scatter(x_ax, ytest, s=5, color="green", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()
</pre>



<p>In the above code, we first import the required libraries:</p>



<p>From <a href="https://blog.finxter.com/pandas-quickstart/" target="_blank" rel="noreferrer noopener" title="10 Minutes to Pandas (in 5 Minutes)">pandas</a>, <code><a href="https://blog.finxter.com/how-to-read-a-csv-file-into-a-python-list/" target="_blank" rel="noreferrer noopener" title="How to Read a CSV File Into a Python List?">read_excel</a></code>, because the (example) data is in an excel file.</p>



<p>From scikit learn, the <code>train_test_split</code> (for random distribution of data), <code>linear_model</code> for LARS regression, mean_sq<code>uared_error</code> and <code>mean_absolute_error</code> for model evaluation, and finally <code>matplotlib</code>.pyplot for model visualization.</p>



<p>Then import the data into a pandas <a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/" target="_blank" rel="noreferrer noopener" title="How to Create a DataFrame in Pandas?">DataFrame</a>. This can be a local file or a valid URL.</p>



<p>Our DataFrame (which is real-world air quality data) has some negative values because when the sensors are not working, the value is -200. We remove this data and create a numpy array from the rest. In our example, we are looking at the relationship between relative humidity and temperature.</p>



<p>The corresponding columns (10, 11) are sliced from the data file and separated into training and test data. This is necessary to judge how good our model is. Split the data set into two data sets:</p>



<p>A &#8220;training&#8221; data set, which we will use to train our model, and a &#8220;test&#8221; data set, which we will use to judge the accuracy of the model.</p>



<div class="wp-block-image"><figure class="aligncenter size-full"><img decoding="async" width="306" height="215" src="https://blog.finxter.com/wp-content/uploads/2021/10/image-21.png" alt="" class="wp-image-36618" srcset="https://blog.finxter.com/wp-content/uploads/2021/10/image-21.png 306w, https://blog.finxter.com/wp-content/uploads/2021/10/image-21-300x211.png 300w" sizes="(max-width: 306px) 100vw, 306px" /></figure></div>



<p>Apply the model to the training data with „<code>lars = linear_model.Lars().fit(xtrain, ytrain)</code>” with default settings,  and then perform the estimation on the test data with the „predict” method of the class.</p>



<p>We then examine how far the model&#8217;s prediction differs from the real data.</p>



<div class="wp-block-image"><figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="479" height="359" src="https://blog.finxter.com/wp-content/uploads/2021/10/image-22.png" alt="" class="wp-image-36619" srcset="https://blog.finxter.com/wp-content/uploads/2021/10/image-22.png 479w, https://blog.finxter.com/wp-content/uploads/2021/10/image-22-300x225.png 300w" sizes="auto, (max-width: 479px) 100vw, 479px" /></figure></div>



<p>MSE represents the mean squared error. It is the square of the distances between the actual points and the regression line. This technique allows the errors to be weighted and negative signs to be eliminated.</p>



<p>MAE is an abbreviation for mean absolute error, which is simply the largest deviation of the measured values from the predicted ones.</p>



<p>In the last part, we will create a graph using matplotlib. For the full course see:</p>



<p><a href="https://academy.finxter.com/university/matplotlib-a-guide-to-becoming-a-data-visualization-wizard/" target="_blank" rel="noreferrer noopener" title="https://academy.finxter.com/university/matplotlib-a-guide-to-becoming-a-data-visualization-wizard/">*** Matplotlib &#8211; The Complete Guide to Becoming a Data Visualization Wizard ***</a></p>



<p>Please note that your graph (and your results) may differ because the algorithm used to select the test and training data is randomly split. </p>



<div class="wp-block-image"><figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="614" height="460" src="https://blog.finxter.com/wp-content/uploads/2021/10/image-23.png" alt="" class="wp-image-36620" srcset="https://blog.finxter.com/wp-content/uploads/2021/10/image-23.png 614w, https://blog.finxter.com/wp-content/uploads/2021/10/image-23-300x225.png 300w" sizes="auto, (max-width: 614px) 100vw, 614px" /><figcaption><strong>Fig 3</strong>: Original vs LARS predicted values</figcaption></figure></div>



<p></p>



<h2 class="wp-block-heading">Summary</h2>



<p>As we have seen, regression is a very important tool in data science, and a relatively new version of it, LARS regression, can be successfully applied in many situations. For this purpose, the Python sckit-learn library provides a convenient solution.</p>



<p></p>
<p>The post <a href="https://blog.finxter.com/how-to-develop-lars-regression-models-in-python/">How to Develop LARS Regression Models in Python?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Install Scikit-Learn on PyCharm?</title>
		<link>https://blog.finxter.com/how-to-install-scikit-learn-on-pycharm/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Tue, 14 Sep 2021 15:03:10 +0000</pubDate>
				<category><![CDATA[Dependency Management]]></category>
		<category><![CDATA[PyCharm]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<category><![CDATA[sklearn]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=35239</guid>

					<description><![CDATA[<p>Scikit-Learn, often abbreviated as sklearn, is a popular machine learning library for Python. Problem Formulation: Given a PyCharm project. How to install the Scikit-Learn library in your project within a virtual environment or globally? Here&#8217;s a solution that always works: Open File &#62; Settings &#62; Project from the PyCharm menu. Select your current project. Click ... <a title="How to Install Scikit-Learn on PyCharm?" class="read-more" href="https://blog.finxter.com/how-to-install-scikit-learn-on-pycharm/" aria-label="Read more about How to Install Scikit-Learn on PyCharm?">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/how-to-install-scikit-learn-on-pycharm/">How to Install Scikit-Learn on PyCharm?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><em><a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener" title="https://scikit-learn.org/stable/">Scikit-Learn</a>, often abbreviated as </em>sklearn<em>, is a popular machine learning library for Python. </em></p>



<p class="has-pale-cyan-blue-background-color has-background"><strong>Problem Formulation:</strong> Given a <a href="https://blog.finxter.com/pycharm-a-simple-illustrated-guide/" target="_blank" rel="noreferrer noopener" title="PyCharm – A Simple Illustrated Guide">PyCharm </a>project. How to install the Scikit-Learn library in your project within a <a href="https://blog.finxter.com/python-virtual-environments-with-venv-a-step-by-step-guide/" target="_blank" rel="noreferrer noopener" title="Python Virtual Environments with “venv” — A Step-By-Step Guide">virtual environment</a> or globally?</p>



<p>Here&#8217;s a <a href="https://blog.finxter.com/how-to-install-a-library-on-pycharm/" target="_blank" rel="noreferrer noopener" title="How to Install a Library on PyCharm?">solution </a>that always works: </p>



<ul class="wp-block-list"><li>Open <code><strong>File &gt; Settings &gt; Project</strong></code> from the PyCharm menu.</li><li>Select your current project.</li><li>Click the <code><strong>Python Interpreter</strong></code> tab within your project tab.</li><li>Click the small <code><strong>+</strong></code> symbol to add a new library to the project. </li><li>Now type in the library to be installed, in your example <code>"sklearn"</code> without quotes, and click <code><strong>Install Package</strong></code>. </li><li>Wait for the installation to terminate and close all popup windows.</li></ul>



<p>Here&#8217;s the installation process as a short animated video&#8212;it works analogously for Scikit-Learn, just type in <em>&#8220;sklearn&#8221;</em> or <em>&#8220;scikit-learn&#8221;</em> in the search field instead:</p>



<figure class="wp-block-image size-large"><img decoding="async" src="https://media.giphy.com/media/VQoZIvOyP23tVvQlW0/source.gif" alt=""/></figure>



<p>Make sure to select only <em>&#8220;scikit-learn&#8221; </em>or <em>&#8220;sklearn&#8221;</em> because there are many other packages that are not required but also contain the same terms (false positives):</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="833" src="https://blog.finxter.com/wp-content/uploads/2021/09/image-56-1024x833.png" alt="scikit-learn install pycharm" class="wp-image-35242" srcset="https://blog.finxter.com/wp-content/uploads/2021/09/image-56-1024x833.png 1024w, https://blog.finxter.com/wp-content/uploads/2021/09/image-56-300x244.png 300w, https://blog.finxter.com/wp-content/uploads/2021/09/image-56-768x625.png 768w, https://blog.finxter.com/wp-content/uploads/2021/09/image-56.png 1158w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>Alternatively, you can run the <code><strong><a href="https://blog.finxter.com/how-to-install-pip-on-windows/" title="How To Install pip On Windows?" target="_blank" rel="noreferrer noopener">pip install</a> sklearn</strong></code> or <code><strong>pip install scikit-learn</strong></code> command in your PyCharm &#8220;<strong>Terminal</strong>&#8221; view:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="1" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install sklearn       # Alternative 1
$ pip install scikit-learn  # Alternative 2</pre>



<p>Both alternatives accomplish the same thing because <em>sklearn </em>is a dummy package pointing to <em>scikit-learn</em> (alias). The following figure shows how to use pip to install the sklearn package:</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="570" src="https://blog.finxter.com/wp-content/uploads/2021/09/image-58-1024x570.png" alt="&quot;pip install sklearn&quot; in PyCharm" class="wp-image-35244" srcset="https://blog.finxter.com/wp-content/uploads/2021/09/image-58-1024x570.png 1024w, https://blog.finxter.com/wp-content/uploads/2021/09/image-58-300x167.png 300w, https://blog.finxter.com/wp-content/uploads/2021/09/image-58-768x427.png 768w, https://blog.finxter.com/wp-content/uploads/2021/09/image-58.png 1228w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>You can check your installation using the following two lines of Python code that print out the <a href="https://blog.finxter.com/python-check-version-of-package-with-pip/" target="_blank" rel="noreferrer noopener" title="Python Check Version of Package with pip">version of the </a>package:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import sklearn
print(sklearn.__version__)</pre>



<hr class="wp-block-separator"/>



<p>Feel free to check out the following free email academy with Python cheat sheets to boost your coding skills!</p>






<p>To become a PyCharm master, check out our <a href="https://academy.finxter.com/university/pycharm/" title="https://academy.finxter.com/university/pycharm/" target="_blank" rel="noreferrer noopener">full course</a> on the Finxter Computer Science Academy available for free for all <a href="https://blog.finxter.com/finxter-premium-membership/" target="_blank" rel="noreferrer noopener" title="Finxter Premium Membership">Finxter Premium Members</a>:</p>



<div class="wp-block-image"><figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/pycharm/" target="_blank" rel="noopener"><img loading="lazy" decoding="async" width="363" height="650" src="https://blog.finxter.com/wp-content/uploads/2021/09/image-10.png" alt="" class="wp-image-34968" srcset="https://blog.finxter.com/wp-content/uploads/2021/09/image-10.png 363w, https://blog.finxter.com/wp-content/uploads/2021/09/image-10-168x300.png 168w" sizes="auto, (max-width: 363px) 100vw, 363px" /></a></figure></div>
<p>The post <a href="https://blog.finxter.com/how-to-install-scikit-learn-on-pycharm/">How to Install Scikit-Learn on PyCharm?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Logistic Regression in Python Scikit-Learn</title>
		<link>https://blog.finxter.com/logistic-regression-in-one-line-python/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Sat, 17 Jul 2021 12:22:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python One-Liners]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<category><![CDATA[sklearn]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=2537</guid>

					<description><![CDATA[<p>Logistic regression is a popular algorithm for classification problems (despite its name indicating that it is a “regression” algorithm). It belongs to one of the most important algorithms in the machine learning space. Linear Regression Background Let’s review linear regression. Given the training data, we compute a line that fits this training data so that ... <a title="Logistic Regression in Python Scikit-Learn" class="read-more" href="https://blog.finxter.com/logistic-regression-in-one-line-python/" aria-label="Read more about Logistic Regression in Python Scikit-Learn">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/logistic-regression-in-one-line-python/">Logistic Regression in Python Scikit-Learn</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Logistic regression is a popular algorithm for classification problems (despite its name indicating that it is a “regression” algorithm). It belongs to one of the most important algorithms in the machine learning space.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Logistic Regression Made Simple [Python]" width="937" height="527" src="https://www.youtube.com/embed/fzxCp-f0CSw?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Linear Regression Background</h2>



<p>Let’s review <a href="https://blog.finxter.com/python-linear-regression-1-liner/" target="_blank" rel="noreferrer noopener"><g class="gr_ gr_5 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Grammar only-ins doubleReplace replaceWithoutSep" id="5" data-gr-id="5">linear</g> regression</a>. Given the training data, we compute a line that fits this training data so that the summed squared distance between the line and the training data is minimal.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="640" height="480" src="https://blog.finxter.com/wp-content/uploads/2019/03/training_data_best_fit.png" alt="" class="wp-image-2541" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/training_data_best_fit.png 640w, https://blog.finxter.com/wp-content/uploads/2019/03/training_data_best_fit-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/training_data_best_fit-100x75.png 100w" sizes="auto, (max-width: 640px) 100vw, 640px" /></figure>



<p>This line can be used for many things – e.g. to predict the outcome for unseen input data <code>x</code>. In general, linear regression is great for predicting a continuous output value <code>y</code>, given continuous input value <code>x</code>. A continuous value can take an infinite number of values. For example, we could predict the stock price (output <code>y</code>), given the number of social media posts mentioning the company that is reflected by the stock price (input <code>x</code>). The stock price is continuous as it can take on any value $123.45, $121.897, or $10,198.87.</p>



<h2 class="wp-block-heading">Logistic Regression and Sigmoid Function</h2>



<p>But what if the output is not continuous but categorical? For example, let’s say you want to predict the <strong><em>likelihood of lung cancer</em></strong>, given the number of cigarettes a patient smoke. Each patient can either have lung cancer or not. In contrast to the previous example, there are only these two possible outcomes.</p>



<p>Predicting the
likelihood of categorical outcomes is the main motivation for logistic
regression.</p>



<p>While linear regression fits a line into the training data, logistic regression fits an S-shaped curve, called <strong><em>“the sigmoid function”</em></strong>. Why? Because the line helps you generate a new output value for each input. On the other hand, the S-shaped curve helps you make binary decisions (e.g. yes/no). For most input values, the sigmoid function will either return a value that is very close to 0 or very close to 1. It is relatively unlikely that your given input value generates a value that is somewhere in-between. </p>



<p>Here is a graphical example of such a scenario:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-17.png" alt="" class="wp-image-2538" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-17.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-17-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-17-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /><figcaption><strong><em>Sigmoid Function Example</em></strong></figcaption></figure></div>



<p>The <a href="https://en.wikipedia.org/wiki/Sigmoid_function" target="_blank" rel="noreferrer noopener">sigmoid function</a> approximates the probability that a patient has lung cancer, given the number of cigarettes they smoke. This probability helps you to make a robust decision on the subject: Does the patient has lung cancer?</p>



<p>Have a look at the following example:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-18.png" alt="" class="wp-image-2539" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-18.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-18-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-18-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure></div>



<p>There are two new patients (in yellow). Let’s pretend we know nothing about them but the number of cigarettes they smoke. We have already trained our logistic regression model (the sigmoid function) that returns a probability value for any new input value <code>x</code>. Now, we can use the respective probabilities of our two inputs to make a prediction about whether the new patients have lung cancer or not.</p>



<p>If the probability given by the sigmoid function is higher than 50%, the model predicts <em><strong>“lung cancer positive”</strong></em>, otherwise, it predicts <strong><em>“lung cancer negative”</em></strong>.</p>



<p>So how to select the
correct sigmoid function that best fits the training data? </p>



<p>This is the main question for logistic regression. The answer is <strong><em><g class="gr_ gr_5 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Grammar only-ins replaceWithoutSep" id="5" data-gr-id="5">maximum</g> likelihood</em></strong>. In other words, which sigmoid function would generate the observed training data with the highest probability?</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-19.png" alt="" class="wp-image-2540" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-19.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-19-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-19-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure></div>



<p>To calculate the
likelihood for a given set of training data, you simply calculate the
likelihood for a single training date and repeat this procedure for all
training dates. Finally, you multiply those to get the likelihood for the whole
set of training data.</p>



<p>Now, you proceed this
same likelihood computation for different sigmoid functions (shifting the
sigmoid function a little bit). From all computations, you take the sigmoid
function that has “maximum likelihood” that means which would produce the
training data with maximal probability.</p>



<h2 class="wp-block-heading">Logistic Regression with sklearn.linear_model</h2>



<p>Let’s program your first <strong><em>virtual doc app</em></strong> using logistic regression – in a <a href="https://blog.finxter.com/python-one-line-x/" target="_blank" rel="noreferrer noopener" title="Python One Line X">single line of Python code!</a></p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="13" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
import numpy as np


## Data (#cigarettes, cancer)
X = np.array([[0, "No"],
              [10, "No"],
              [60, "Yes"],
              [90, "Yes"]])


## One-liner
model = LogisticRegression().fit(X[:,0].reshape(-1,1), X[:,1])


## Result &amp; puzzle
print(model.predict([[2],[12],[13],[40],[90]]))
</pre>



<p><em><strong>Exercise</strong>: What is the output of this code snippet? Take a guess!</em></p>



<p>The labeled training data set <code>X</code> consists of four patient records (lines) with two features (columns). The first column holds the number of cigarettes the patients smoke, and the second column holds whether they ultimately suffered from lung cancer. Hence, there is a continuous input variable and a categorical output variable. It’s a classification problem!</p>



<p>We build the model calling the <code>LogisticRegression()</code> constructor with no parameters. On this model, we call the <code><a href="https://blog.finxter.com/sklearn-fit-vs-transform-vs-fit_transform-whats-the-difference/" title="Sklearn fit() vs transform() vs fit_transform() – What’s the Difference?" target="_blank" rel="noreferrer noopener">fit</a></code> function which takes two arguments: the input values and the output classifications (labels). The input values are expected to come as a two-dimensional array where each row holds the feature values. </p>



<p>In our case, we only have a single feature value so we transform our input into a column vector using the <code><a href="https://blog.finxter.com/numpy-reshape/" title="The Ultimate Guide to NumPy Reshape() in Python" target="_blank" rel="noreferrer noopener">reshape()</a></code> operation that generates a two-dimensional <a href="https://blog.finxter.com/numpy-tutorial/" title="NumPy Tutorial – Everything You Need to Know to Get Started" target="_blank" rel="noreferrer noopener">NumPy </a>array. The first argument specifies the number of rows, the second specifies the number of columns. We only care about the number of columns which is one. NumPy determines the number of rows automatically when using the “dummy” parameter -1.</p>



<p>Here
is how the input training data (without labels) looks like after converting it
using the reshape operation:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">[[0],
 [10],
 [60],
 [90]]
</pre>



<p>Next,
we predict whether a patient has lung cancer, given the number of cigarettes
they smoke: 2, 12, 13, 40, 90 cigarettes.</p>



<p>Here is the output:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Result &amp; puzzle
print(model.predict([[2],[12],[13],[40],[90]]))
# ['No' 'No' 'Yes' 'Yes' 'Yes']
</pre>



<p>The model predicts
that the first two patients are lung cancer negative, while the latter three
are lung cancer positive.</p>



<p>Let’s explore in detail the probabilities of the sigmoid function that lead to this prediction! Simply run the following code snippet after the above definition:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">for i in range(20):
    print("x=" + str(i) + " --> " + str(model.predict_proba([[i]])))

    
'''
x=0 --> [[0.67240789 0.32759211]]
x=1 --> [[0.65961501 0.34038499]]
x=2 --> [[0.64658514 0.35341486]]
x=3 --> [[0.63333374 0.36666626]]
x=4 --> [[0.61987758 0.38012242]]
x=5 --> [[0.60623463 0.39376537]]
x=6 --> [[0.59242397 0.40757603]]
x=7 --> [[0.57846573 0.42153427]]
x=8 --> [[0.56438097 0.43561903]]
x=9 --> [[0.55019154 0.44980846]]
x=10 --> [[0.53591997 0.46408003]]
x=11 --> [[0.52158933 0.47841067]]
x=12 --> [[0.50722306 0.49277694]]
x=13 --> [[0.49284485 0.50715515]]
x=14 --> [[0.47847846 0.52152154]]
x=15 --> [[0.46414759 0.53585241]]
x=16 --> [[0.44987569 0.55012431]]
x=17 --> [[0.43568582 0.56431418]]
x=18 --> [[0.42160051 0.57839949]]
x=19 --> [[0.40764163 0.59235837]]
'''
</pre>



<p>The code prints for any value of <code>x</code> (the number of cigarettes) the probabilities of lung cancer positive and lung cancer negative. If the probability of the former is higher than the probability of the latter, the predicted outcome is “lung cancer negative”. This happens the last time for <code>x=12</code>. When smoking more than 12 cigarettes, the algorithm will classify a patient to be “lung cancer positive”.</p>



<h2 class="wp-block-heading">LogisticsRegression Methods</h2>



<p>In the previous example, you&#8217;ve created a <code>LogisticRegression</code> object using the following constructor:</p>



<pre class="wp-block-preformatted"><code>sklearn.linear_model.LogisticRegression</code>(<em>penalty='l2'</em>, <em>*</em>, <em>dual=False</em>, <em>tol=0.0001</em>, <em>C=1.0</em>, <em>fit_intercept=True</em>, <em>intercept_scaling=1</em>, <em>class_weight=None</em>, <em>random_state=None</em>, <em>solver='lbfgs'</em>, <em>max_iter=100</em>, <em>multi_class='auto'</em>, <em>verbose=0</em>, <em>warm_start=False</em>, <em>n_jobs=None</em>, <em>l1_ratio=None</em>)</pre>



<p>In most cases, you don&#8217;t need to define all arguments&#8212;or even understand them by heart. Just start from the most basic example usage and customize as you The <code>LogisticRegression</code> class has many more helper methods. You can check them out here (<a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html" target="_blank" rel="noreferrer noopener" title="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">source</a>):</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>Name</th><th>Description</th></tr></thead><tbody><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.decision_function">decision_function</a>(X)</code></td><td>Predict confidence scores for samples.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.densify">densify</a>()</code></td><td>Convert coefficient matrix to dense array format.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit">fit</a>(X, y[, sample_weight])</code></td><td>Fit the model according to the given training data.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.get_params">get_params</a>([deep])</code></td><td>Get parameters for this estimator.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict">predict</a>(X)</code></td><td>Predict class labels for samples in <code>X</code>.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_log_proba">predict_log_proba</a>(X)</code></td><td>Predict logarithm of probability estimates.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba">predict_proba</a>(X)</code></td><td>Probability estimates.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score">score</a>(X, y[, sample_weight])</code></td><td>Return the mean accuracy on the given test data and labels.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.set_params">set_params</a>(**params)</code></td><td>Set the parameters of this estimator.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.sparsify">sparsify</a>()</code></td><td>Convert coefficient matrix to sparse format.</td></tr></tbody></table></figure>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Logistic regression is a classification algorithm (despite its name). This article shows you everything you need to know to start with logistic regression now. It provides you an easy way to implement logistic regression in a single line of Python code using the <a class="" href="https://scikit-learn.org/"><g class="gr_ gr_651 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="651" data-gr-id="651">scikit</g></a><a href="https://scikit-learn.org/" target="_blank" rel="noreferrer noopener">-learn library</a>.</p>



<p>If you feel stuck in Python and you need to enter the next level in Python coding, feel free to enter my 100% free Python email course with lots of cheat sheets, Python lessons, code contests, and fun!</p>






<p>This tutorial is loosely based on my <a href="https://pythononeliners.com/" title="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener">Python One-Liners</a> book chapter. Check it out!</p>



<h2 class="wp-block-heading">Python One-Liners Book: Master the Single Line First!</h2>



<p><strong>Python programmers will improve their computer science skills with these useful one-liners.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-medium is-resized"><a href="https://www.amazon.com/gp/product/B07ZY7XMX8" target="_blank" rel="noopener noreferrer"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-1024x944.jpg" alt="Python One-Liners" class="wp-image-10007" width="512" height="472" srcset="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-scaled.jpg 1024w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-300x277.jpg 300w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-768x708.jpg 768w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>



<p><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Python One-Liners</em> </a>will teach you how to read and write &#8220;one-liners&#8221;: <strong><em>concise statements of useful functionality packed into a single line of code. </em></strong>You&#8217;ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.</p>



<p>The book&#8217;s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms. </p>



<p>Detailed explanations of one-liners introduce <strong><em>key computer science concepts </em></strong>and<strong><em> boost your coding and analytical skills</em></strong>. You&#8217;ll learn about advanced Python features such as <em><strong>list comprehension</strong></em>, <strong><em>slicing</em></strong>, <strong><em>lambda functions</em></strong>, <strong><em>regular expressions</em></strong>, <strong><em>map </em></strong>and <strong><em>reduce </em></strong>functions, and <strong><em>slice assignments</em></strong>. </p>



<p>You&#8217;ll also learn how to:</p>



<ul class="wp-block-list"><li>Leverage data structures to <strong>solve real-world problems</strong>, like using Boolean indexing to find cities with above-average pollution</li><li>Use <strong>NumPy basics</strong> such as <em>array</em>, <em>shape</em>, <em>axis</em>, <em>type</em>, <em>broadcasting</em>, <em>advanced indexing</em>, <em>slicing</em>, <em>sorting</em>, <em>searching</em>, <em>aggregating</em>, and <em>statistics</em></li><li>Calculate basic <strong>statistics </strong>of multidimensional data arrays and the K-Means algorithms for unsupervised learning</li><li>Create more <strong>advanced regular expressions</strong> using <em>grouping </em>and <em>named groups</em>, <em>negative lookaheads</em>, <em>escaped characters</em>, <em>whitespaces, character sets</em> (and <em>negative characters sets</em>), and <em>greedy/nongreedy operators</em></li><li>Understand a wide range of <strong>computer science topics</strong>, including <em>anagrams</em>, <em>palindromes</em>, <em>supersets</em>, <em>permutations</em>, <em>factorials</em>, <em>prime numbers</em>, <em>Fibonacci </em>numbers, <em>obfuscation</em>, <em>searching</em>, and <em>algorithmic sorting</em></li></ul>



<p>By the end of the book, you&#8217;ll know how to <strong><em>write Python at its most refined</em></strong>, and create concise, beautiful pieces of &#8220;Python art&#8221; in merely a single line.</p>



<p><strong><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Get your Python One-Liners on Amazon!!</em></a></strong></p>
<p>The post <a href="https://blog.finxter.com/logistic-regression-in-one-line-python/">Logistic Regression in Python Scikit-Learn</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Random Forest Classifier with sklearn</title>
		<link>https://blog.finxter.com/random-forest-classifier-made-simple/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Tue, 13 Jul 2021 14:30:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<category><![CDATA[sklearn]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=2531</guid>

					<description><![CDATA[<p>Does your model&#8217;s prediction accuracy suck but you need to meet the deadline at all costs? Try the quick and dirty “meta-learning” approach called ensemble learning. In this article, you&#8217;ll learn about a specific ensemble learning technique called random forests that combines the predictions (or classifications) of multiple machine learning algorithms. In many cases, it ... <a title="Random Forest Classifier with sklearn" class="read-more" href="https://blog.finxter.com/random-forest-classifier-made-simple/" aria-label="Read more about Random Forest Classifier with sklearn">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/random-forest-classifier-made-simple/">Random Forest Classifier with sklearn</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong><em>Does your model&#8217;s prediction accuracy suck but you need to meet the deadline at all costs? </em></strong></p>



<p>Try the quick and dirty “meta-learning” approach called <strong><em>ensemble learning</em></strong>. In this article, you&#8217;ll learn about a specific ensemble learning technique called<strong><em> random forests</em></strong> that combines the predictions (or classifications) of multiple machine learning algorithms. In many cases, it will give you better last-minute results.</p>



<h2 class="wp-block-heading">Video Random Forest Classification Python</h2>



<p>This video gives you a concise introduction into ensemble learning with random forests using <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">sklearn</a>:</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Random Forest Classifier Made Simple" width="937" height="527" src="https://www.youtube.com/embed/oWu5Au4VpSY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Ensemble Learning</h2>



<p>You may already have studied multiple machine learning algorithms&#8212;and realized that different algorithms have different strengths. </p>



<p>For example, <a href="https://blog.finxter.com/tutorial-how-to-create-your-first-neural-network-in-1-line-of-python-code/" target="_blank" rel="noreferrer noopener" title="Neural Networks with SKLearn MLPRegressor">neural network classifiers</a> can generate excellent results for complex problems. However, they are also prone to “<strong><em>overfitting</em></strong>” the data because of their powerful capacity of memorizing fine-grained patterns of the data.</p>



<p>The simple idea of ensemble learning for classification problems leverages the fact that <strong><em>you often don’t know in advance which machine learning technique works best.</em></strong></p>



<p><strong>How does ensemble learning work?</strong> You create a meta-classifier consisting of multiple types or instances of basic <a href="https://blog.finxter.com/top-20-machine-learning-library-cheat-sheets/" title="Top 37 Python Machine Learning Library Cheat Sheets" target="_blank" rel="noreferrer noopener">machine learning </a>algorithms. In other words, you train <em>multiple </em>models. To classify a <em>single </em>observation, you ask <em>all </em>models to classify the input independently. Now, you return the class that was returned most often, given your input, as a <em><strong>“meta-prediction”</strong></em>. This is the final output of your ensemble learning algorithm.</p>



<h2 class="wp-block-heading">Random Forest Learning</h2>



<p><strong>Random forests are a special type of ensemble learning algorithms.</strong> They focus on <a href="https://blog.finxter.com/decision-tree-learning-in-one-line-python/" target="_blank" rel="noreferrer noopener" title="Python Scikit-Learn Decision Tree [Video + Blog]">decision tree </a>learning. A forest consists of many trees. Similarly, a random forest consists of many decision trees.</p>



<p>Each decision tree is built by injecting randomness in the tree generation procedure during the training phase (e.g. which tree node to select first). This leads to various decision trees – exactly what we want.</p>



<p>Here is how the prediction works for a trained random forest:<br></p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh3.googleusercontent.com/TuwxEzk0Td8B_qhlgMCVRqx5ElV6bkwu18EYnzMKZxtXjcnt9b5H_WxctAvu1UoCGWqaR2qDXENk6XxFH1RiHpn3G_U-oZWkaFen_4VUH33Z_SXhbnsm7ztz-mmmm5CZWZV9e8ac" alt="Random Forest Example"/></figure></div>



<p>In the example, Alice has high<em> maths </em>and <em>language </em>skills. &nbsp;The “ensemble” consists of three decision trees (building a random forest). To classify Alice, each decision tree is queried about Alice’s classification. Two of the decision trees classify Alice as a <em>computer scientist</em>. As this is the class with most votes, it is returned as final output for the classification.</p>



<h2 class="wp-block-heading"><strong>sklearn.ensemble.RandomForestClassifier</strong></h2>



<p>Let’s stick to this example of classifying the study field based on a student’s skill level in three different areas (math, language, creativity). You may think that implementing an ensemble learning method is complicated in Python. But it’s not – thanks to the comprehensive <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier">scikit-learn library</a>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="17" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Dependencies
import numpy as np
from sklearn.ensemble import RandomForestClassifier


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [5, 1, 5, "computer science"],
              [8, 8, 8, "computer science"],
              [1, 10, 7, "literature"],
              [1, 8, 1, "literature"],
              [5, 7, 9, "art"],
              [1, 1, 6, "art"]])


## One-liner
Forest = RandomForestClassifier(n_estimators=10).fit(X[:,:-1], X[:,-1])

## Result &amp; puzzle
students = Forest.predict([[8, 6, 5],
                         [3, 7, 9],
                         [2, 2, 1]])
print(students)</pre>



<p><em><strong>Take a guess:</strong> what’s the output of this code snippet?</em></p>



<p>After initializing the labeled training data, the code creates a random forest using the constructor on the class <code>RandomForestClassifier</code> with one parameter <code>n_estimators</code> that defines the number of trees in the forest.</p>



<p>Next, we populate the model that results from the previous initialization (an empty forest) by calling the function <code><a href="https://blog.finxter.com/sklearn-fit-vs-transform-vs-fit_transform-whats-the-difference/" title="Sklearn fit() vs transform() vs fit_transform() – What’s the Difference?" target="_blank" rel="noreferrer noopener">fit()</a></code>. To this end, the input training data consists of all but the last column of array <code>X</code>, while the labels of the training data are defined in the last column. As in the previous examples, we use <a href="https://blog.finxter.com/introduction-to-slicing-in-python/" title="Introduction to Slicing in Python" target="_blank" rel="noreferrer noopener">slicing </a>to extract the respective columns from the data array <code>X</code>.</p>



<p><strong>Related Tutorial:</strong> <a href="https://blog.finxter.com/introduction-to-slicing-in-python/" target="_blank" rel="noreferrer noopener" title="Introduction to Slicing in Python">Introduction to Python Slicing</a></p>



<p>The classification part is slightly different in this code snippet. I wanted to show you how to classify multiple observations instead of only one. You can simply achieve this here by creating a <a href="https://blog.finxter.com/numpy-tutorial/" target="_blank" rel="noreferrer noopener" title="NumPy Tutorial – Everything You Need to Know to Get Started">multi-dimensional array</a> with one row per observation.</p>



<p>Here is the output of the code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Result &amp; puzzle
students = Forest.predict([[8, 6, 5],
                         [3, 7, 9],
                         [2, 2, 1]])
print(students)
# ['computer science' 'art' 'art']</pre>



<p>Note that the result is still non-deterministic (which means the result may be different for different executions of the code) because the random forest algorithm relies on the <a href="https://blog.finxter.com/how-to-generate-random-integers-in-python/" title="How to Generate Random Integers in Python?" target="_blank" rel="noreferrer noopener">random number generator</a> that returns different numbers at different points in time. You can make this call deterministic by using the argument <code>random_state</code>.</p>



<h2 class="wp-block-heading">RandomForestClassifier Methods</h2>



<p>The <code>RandomForestClassifier</code> object has the following methods (<a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank" rel="noreferrer noopener" title="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">source</a>):</p>



<figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.apply">apply</a>(X)</code></td><td>Apply trees in the forest to <code>X</code> and return leaf indices.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.decision_path">decision_path</a>(X)</code></td><td>Return the decision path in the forest.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit">fit</a>(X,&nbsp;y[,&nbsp;sample_weight])</code></td><td>Build a forest of trees from the training set <code>(X, y)</code>.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params">get_params</a>([deep])</code></td><td>Get parameters for this estimator.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict">predict</a>(X)</code></td><td>Predict class for <code>X</code>.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_log_proba">predict_log_proba</a>(X)</code></td><td>Predict class log-probabilities for <code>X</code>.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba">predict_proba</a>(X)</code></td><td>Predict class probabilities for <code>X</code>.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score">score</a>(X,&nbsp;y[,&nbsp;sample_weight])</code></td><td>Return the mean accuracy on the given test data and labels.</td></tr><tr><td><code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.set_params">set_params</a>(**params)</code></td><td>Set the parameters of this estimator.</td></tr></tbody></table></figure>



<p>To learn about the different arguments of the <code>RandomForestClassifier()</code> constructor, feel free to visit the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank" rel="noreferrer noopener" title="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">official documentation</a>. However, the default arguments are often enough to create powerful classification meta-models. </p>



<h2 class="wp-block-heading">Where to Go From Here?</h2>



<p>Random Forests built upon a thorough understanding of Decision Tree Learning. Read my <a href="https://blog.finxter.com/decision-tree-learning-in-one-line-python/" target="_blank" rel="noreferrer noopener">article about decision trees </a>to improve your understanding of this area.</p>



<p>If you feel that you need to refresh your Python skills, download your Python Cheat Sheets (and get regularly new cheat sheets) by subscribing to my email list.</p>






<p>You can level up your skills with our new Python learning system based on solving rated Python code puzzles. You do nothing but solving Python puzzles and observe how your Python rating improves. </p>



<p><a href="https://finxter.com">Test your coding skills by solving Python puzzles now!</a><br></p>



<p>This article is based on my book &#8220;Python One-Liners&#8221;. Feel free to check out the additional material to help you master the single line like nobody else!</p>



<h2 class="wp-block-heading">Python One-Liners Book: Master the Single Line First!</h2>



<p><strong>Python programmers will improve their computer science skills with these useful one-liners.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-medium is-resized"><a href="https://www.amazon.com/gp/product/B07ZY7XMX8" target="_blank" rel="noopener noreferrer"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-1024x944.jpg" alt="Python One-Liners" class="wp-image-10007" width="512" height="472" srcset="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-scaled.jpg 1024w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-300x277.jpg 300w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-768x708.jpg 768w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>



<p><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Python One-Liners</em> </a>will teach you how to read and write &#8220;one-liners&#8221;: <strong><em>concise statements of useful functionality packed into a single line of code. </em></strong>You&#8217;ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.</p>



<p>The book&#8217;s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms. </p>



<p>Detailed explanations of one-liners introduce <strong><em>key computer science concepts </em></strong>and<strong><em> boost your coding and analytical skills</em></strong>. You&#8217;ll learn about advanced Python features such as <em><strong>list comprehension</strong></em>, <strong><em>slicing</em></strong>, <strong><em>lambda functions</em></strong>, <strong><em>regular expressions</em></strong>, <strong><em>map </em></strong>and <strong><em>reduce </em></strong>functions, and <strong><em>slice assignments</em></strong>. </p>



<p>You&#8217;ll also learn how to:</p>



<ul class="wp-block-list"><li>Leverage data structures to <strong>solve real-world problems</strong>, like using Boolean indexing to find cities with above-average pollution</li><li>Use <strong>NumPy basics</strong> such as <em>array</em>, <em>shape</em>, <em>axis</em>, <em>type</em>, <em>broadcasting</em>, <em>advanced indexing</em>, <em>slicing</em>, <em>sorting</em>, <em>searching</em>, <em>aggregating</em>, and <em>statistics</em></li><li>Calculate basic <strong>statistics </strong>of multidimensional data arrays and the K-Means algorithms for unsupervised learning</li><li>Create more <strong>advanced regular expressions</strong> using <em>grouping </em>and <em>named groups</em>, <em>negative lookaheads</em>, <em>escaped characters</em>, <em>whitespaces, character sets</em> (and <em>negative characters sets</em>), and <em>greedy/nongreedy operators</em></li><li>Understand a wide range of <strong>computer science topics</strong>, including <em>anagrams</em>, <em>palindromes</em>, <em>supersets</em>, <em>permutations</em>, <em>factorials</em>, <em>prime numbers</em>, <em>Fibonacci </em>numbers, <em>obfuscation</em>, <em>searching</em>, and <em>algorithmic sorting</em></li></ul>



<p>By the end of the book, you&#8217;ll know how to <strong><em>write Python at its most refined</em></strong>, and create concise, beautiful pieces of &#8220;Python art&#8221; in merely a single line.</p>



<p><strong><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Get your Python One-Liners on Amazon!!</em></a></strong></p>
<p>The post <a href="https://blog.finxter.com/random-forest-classifier-made-simple/">Random Forest Classifier with sklearn</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>SVM sklearn: Python Support Vector Machines Made Simple</title>
		<link>https://blog.finxter.com/support-vector-machines-python/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Sun, 11 Jul 2021 10:33:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NumPy]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python One-Liners]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<category><![CDATA[sklearn]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=2489</guid>

					<description><![CDATA[<p>Support Vector Machines (SVM) have gained huge popularity in recent years. The reason is their robust classification performance – even in high-dimensional spaces: SVMs even work if there are more dimensions (features) than data items. This is unusual for classification algorithms because of the curse of dimensionality – with increasing dimensionality, data becomes extremely sparse ... <a title="SVM sklearn: Python Support Vector Machines Made Simple" class="read-more" href="https://blog.finxter.com/support-vector-machines-python/" aria-label="Read more about SVM sklearn: Python Support Vector Machines Made Simple">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/support-vector-machines-python/">SVM sklearn: Python Support Vector Machines Made Simple</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Support Vector Machines</strong> (SVM) have gained huge popularity in recent years. The reason is their robust classification performance – even in high-dimensional spaces: SVMs even work if there are more dimensions (features) than data items. This is unusual for classification algorithms because of the <em>curse of dimensionality</em> – with increasing dimensionality, data becomes extremely sparse which makes it hard for algorithms to find patterns in the data set. </p>



<p>Understanding the basic ideas of SVMs is a fundamental step to becoming a <strong><em>sophisticated machine learning engineer</em></strong>.</p>



<h2 class="wp-block-heading">SVM Video</h2>



<p>Feel free to watch the following video that summarizes shortly how SVMs work in Python:</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Support Vector Machines Made Simple" width="937" height="527" src="https://www.youtube.com/embed/s11gMR_Rrpo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">SVM Cheat Sheet</h2>



<p>Here is a cheat sheet that summarizes the content of this article:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="720" height="960" src="https://blog.finxter.com/wp-content/uploads/2019/03/CheatSheet-Python-10-Machine-Learning-SVM.jpg" alt="" class="wp-image-2496" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/CheatSheet-Python-10-Machine-Learning-SVM.jpg 720w, https://blog.finxter.com/wp-content/uploads/2019/03/CheatSheet-Python-10-Machine-Learning-SVM-225x300.jpg 225w, https://blog.finxter.com/wp-content/uploads/2019/03/CheatSheet-Python-10-Machine-Learning-SVM-100x133.jpg 100w, https://blog.finxter.com/wp-content/uploads/2019/03/CheatSheet-Python-10-Machine-Learning-SVM-670x893.jpg 670w" sizes="auto, (max-width: 720px) 100vw, 720px" /></figure></div>



<p>You can get this cheat sheet&#8212;along with additional Python cheat sheets&#8212;as a high-resolution PDFs here:</p>






<p>Let&#8217;s get a conceptual of support vector machines first before learning how to use them with <code><a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">sklearn</a></code>.</p>



<h2 class="wp-block-heading">Machine Learning Classification Overview</h2>



<p>How do classification
algorithms work? They use the training data to find a decision boundary that
divides data in the one class from data in the other class. </p>



<p>Here is an example:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-15.png" alt="Classification Problem" class="wp-image-2490" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-15.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-15-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-15-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure></div>



<p>Suppose, you want to build a <strong><em>recommendation system</em></strong> for aspiring university students. The figure visualizes the training data consisting of users that are classified according to their skills in two areas: <strong><em>logic </em></strong>and <strong><em>creativity</em></strong>. Some persons have high logic skills and relatively low creativity, others have high creativity and relatively low logic skills. The first group is labeled as <em>“computer scientists”</em> and the second group is labeled as <em>“artists”</em>. (I know that there are also creative computer scientists, but let’s stick with this example for a moment.)</p>



<p>In order to classify new users, the <a href="https://blog.finxter.com/machine-learning-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="Best 15+ Machine Learning Cheat Sheets to Pin to Your Toilet Wall">machine learning</a> model must find a <strong><em>decision boundary</em></strong> that separates the computer scientists from the artists. Roughly speaking, you will check for a new user in which area they fall with respect to the decision boundary: left or right? Users that fall into the left area are classified as computer scientists, while users that fall into the right area are classified as artists. </p>



<p>In the two-dimensional space, the decision boundary is either a line or a (higher-order) curve. The former is called a <em><strong>“linear classifier”,</strong></em> the latter is called a <strong><em>“non-linear classifier”</em></strong>. In this section, we will only explore linear classifiers.</p>



<p>The figure
shows three decision boundaries that are all valid separators of the data. For a
standard classifier, it is impossible to quantify which of the given decision
boundaries is better – they all lead to perfect accuracy when classifying the
training data.</p>



<h2 class="wp-block-heading">Support Vector Machine Classification Overview</h2>



<p><strong><em>But what is the best decision boundary?</em></strong></p>



<p>Support vector machines provide a unique and beautiful answer to this question. Arguably, the best decision boundary provides a maximal margin of safety. In other words, SVMs <strong><em>maximize the distance between the closest data points and the decision boundary</em></strong>. The idea is to minimize the error of new points that are close to the decision boundary.</p>



<p>Here is an example:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-16.png" alt="SVM Decision Boundary" class="wp-image-2491" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-16.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-16-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-16-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure></div>



<p>The SVM classifier finds the respective support vectors so that the zone between the different support vectors is <strong><em>as thick as possible</em></strong>. The decision boundary is the line in the middle with maximal distance to the support vectors. Because the zone between the support vectors and the decision boundary is maximized, the <strong><em>margin of safety is expected to be maximal</em></strong> when classifying new data points. This idea shows high classification accuracy for many practical problems.</p>



<h2 class="wp-block-heading">Scikit-Learn SVM Code</h2>



<p>Let&#8217;s have a look how the <code>sklearn</code> library provides a simple means for you to use SVM classification on your own labeled data. I highlighted the sklearn relevant lines in the following code snippet:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="2, 16, 20, 23" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Dependencies
from sklearn import svm
import numpy as np


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [10, 1, 2, "computer science"],
              [1, 8, 1, "literature"],
              [4, 9, 3, "literature"],
              [0, 1, 10, "art"],
              [5, 7, 9, "art"]])


## One-liner
svm = svm.SVC().fit(X[:,:-1], X[:,-1])


## Result &amp; puzzle
student_0 = svm.predict([[3, 3, 6]])
print(student_0)

student_1 = svm.predict([[8, 1, 1]])
print(student_1)</pre>



<p><em><strong>Guess</strong>: what is the output of this code?</em></p>



<p>The code breaks down how you can use support vector machines in Python in its most basic form. The <a href="https://blog.finxter.com/numpy-tutorial/" target="_blank" rel="noreferrer noopener" title="NumPy Tutorial – Everything You Need to Know to Get Started">NumPy </a>array holds the labeled training data with one row per user and one column per feature (skill level in maths, language, and creativity). The last column is the label (the class).  </p>



<p>Because we have three-dimensional data, the support vector machine separates the data using <strong><em>two-dimensional planes</em></strong> (the linear separator) rather than one-dimensional lines.  As you can see, it is also possible to separate three different classes rather than only two as shown in the examples above.</p>



<p>The <a href="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener" title="https://pythononeliners.com/">one-liner</a> itself is straightforward: you first create the model using the constructor of the <code>svm.SVC</code> class (<em>SVC</em> stands for <em>support vector classification</em>). Then, you call the <code><a href="https://blog.finxter.com/sklearn-fit-vs-transform-vs-fit_transform-whats-the-difference/" title="Sklearn fit() vs transform() vs fit_transform() – What’s the Difference?">fit</a></code> function to perform the training based on your labeled training data.</p>



<p>In the results part of the code snippet, we simply call the <code>predict</code> function on new observations: </p>



<ul class="wp-block-list"><li>Because <code>student_0</code> has skills <code>maths=3</code>, <code>language=3</code>, and <code>creativity=6</code>, the support vector machine predicts that the label <strong><em>“art”</em></strong> fits this student’s skills. </li><li>Similarly, <code>student_1</code> has skills <code>maths=8</code>, <code>language=1</code>, and <code>creativity=1</code>. Thus, the support vector machine predicts that the label <strong><em>“computer science”</em></strong> fits this student’s skills.</li></ul>



<p>Here is the final output of the <a href="https://blog.finxter.com/python-one-line-x/" target="_blank" rel="noreferrer noopener" title="Python One Line X">one-liner</a>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Result &amp; puzzle
student_0 = svm.predict([[3, 3, 6]])
print(student_0)
# ['art']

student_1 = svm.predict([[8, 1, 1]])
print(student_1)
## ['computer science']
</pre>



<h2 class="wp-block-heading">Where to Go From Here?</h2>



<p>This tutorial provides you the quickest and most concise way of starting out with support vector machines (SVMs). You won&#8217;t find any easier way on the whole Internet.</p>



<p>In fact, I wrote this as a chapter draft for my book <em><strong><a href="https://pythononeliners.com/" title="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener">Python One-Liners</a></strong></em> that also introduces 10 machine learning algorithms, and how to use them in a single line of Python code.</p>



<p>Here&#8217;s more about the book:</p>



<h2 class="wp-block-heading">Python One-Liners Book: Master the Single Line First!</h2>



<p><strong>Python programmers will improve their computer science skills with these useful one-liners.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-medium is-resized"><a href="https://www.amazon.com/gp/product/B07ZY7XMX8" target="_blank" rel="noopener noreferrer"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-1024x944.jpg" alt="Python One-Liners" class="wp-image-10007" width="512" height="472" srcset="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-scaled.jpg 1024w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-300x277.jpg 300w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-768x708.jpg 768w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>



<p><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Python One-Liners</em> </a>will teach you how to read and write &#8220;one-liners&#8221;: <strong><em>concise statements of useful functionality packed into a single line of code. </em></strong>You&#8217;ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.</p>



<p>The book&#8217;s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms. </p>



<p>Detailed explanations of one-liners introduce <strong><em>key computer science concepts </em></strong>and<strong><em> boost your coding and analytical skills</em></strong>. You&#8217;ll learn about advanced Python features such as <em><strong>list comprehension</strong></em>, <strong><em>slicing</em></strong>, <strong><em>lambda functions</em></strong>, <strong><em>regular expressions</em></strong>, <strong><em>map </em></strong>and <strong><em>reduce </em></strong>functions, and <strong><em>slice assignments</em></strong>. </p>



<p>You&#8217;ll also learn how to:</p>



<ul class="wp-block-list"><li>Leverage data structures to <strong>solve real-world problems</strong>, like using Boolean indexing to find cities with above-average pollution</li><li>Use <strong>NumPy basics</strong> such as <em>array</em>, <em>shape</em>, <em>axis</em>, <em>type</em>, <em>broadcasting</em>, <em>advanced indexing</em>, <em>slicing</em>, <em>sorting</em>, <em>searching</em>, <em>aggregating</em>, and <em>statistics</em></li><li>Calculate basic <strong>statistics </strong>of multidimensional data arrays and the K-Means algorithms for unsupervised learning</li><li>Create more <strong>advanced regular expressions</strong> using <em>grouping </em>and <em>named groups</em>, <em>negative lookaheads</em>, <em>escaped characters</em>, <em>whitespaces, character sets</em> (and <em>negative characters sets</em>), and <em>greedy/nongreedy operators</em></li><li>Understand a wide range of <strong>computer science topics</strong>, including <em>anagrams</em>, <em>palindromes</em>, <em>supersets</em>, <em>permutations</em>, <em>factorials</em>, <em>prime numbers</em>, <em>Fibonacci </em>numbers, <em>obfuscation</em>, <em>searching</em>, and <em>algorithmic sorting</em></li></ul>



<p>By the end of the book, you&#8217;ll know how to <strong><em>write Python at its most refined</em></strong>, and create concise, beautiful pieces of &#8220;Python art&#8221; in merely a single line.</p>



<p><strong><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Get your Python One-Liners on Amazon!!</em></a></strong></p>
<p>The post <a href="https://blog.finxter.com/support-vector-machines-python/">SVM sklearn: Python Support Vector Machines Made Simple</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>K-Nearest Neighbors (KNN) with sklearn in Python</title>
		<link>https://blog.finxter.com/k-nearest-neighbors-as-a-python-one-liner/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Thu, 10 Jun 2021 16:09:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python One-Liners]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=2445</guid>

					<description><![CDATA[<p>The popular K-Nearest Neighbors (KNN) algorithm is used for regression and classification in many applications such as recommender systems, image classification, and financial data forecasting. It is the basis of many advanced machine learning techniques (e.g., in information retrieval). There is no doubt that understanding KNN is an important building block of your proficient computer ... <a title="K-Nearest Neighbors (KNN) with sklearn in Python" class="read-more" href="https://blog.finxter.com/k-nearest-neighbors-as-a-python-one-liner/" aria-label="Read more about K-Nearest Neighbors (KNN) with sklearn in Python">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/k-nearest-neighbors-as-a-python-one-liner/">K-Nearest Neighbors (KNN) with sklearn in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>The popular <strong><em>K-Nearest Neighbors</em></strong> (KNN) algorithm is used for <a href="https://blog.finxter.com/python-linear-regression-1-liner/" title="Python Linear Regression with sklearn – A Helpful Illustrated Guide" target="_blank" rel="noreferrer noopener">regression </a>and <a href="https://blog.finxter.com/random-forest-classifier-made-simple/" title="Random Forest Classifier Made Simple" target="_blank" rel="noreferrer noopener">classification </a>in many applications such as recommender systems, image classification, and financial data forecasting. It is the basis of many advanced machine learning techniques (e.g., in information retrieval). There is no doubt that understanding KNN is an important building block of your proficient computer science education.</p>



<p>Watch the article as a video:</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="K-Nearest Neighbors (KNN) as a Python One-liner [Easy Tutorial]" width="937" height="527" src="https://www.youtube.com/embed/0WfWcf58qtU?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>K-Nearest Neighbors (KNN) is a robust, simple, and popular <a href="https://blog.finxter.com/machine-learning-cheat-sheets/" title="Best 15+ Machine Learning Cheat Sheets to Pin to Your Toilet Wall" target="_blank" rel="noreferrer noopener">machine learning</a> algorithm. It’s relatively easy to implement from scratch while being competitive and performant. </p>



<h2 class="wp-block-heading">Recap Machine Learning</h2>



<p>Machine learning is all about learning a so-called <em><strong>model </strong></em>from a given <em>training data set</em>. </p>



<p>This model can then be used for inference, i.e., predicting output values for potentially new and unseen input data. </p>



<p>A model usually is a high-level abstraction such as a mathematical function inferred from the training data. Most machine learning techniques attempt to find patterns in the data that can be captured and used for generalization and prediction on new input data.</p>



<h2 class="wp-block-heading">KNN Training</h2>



<p>However, KNN follows a quite different path. The simple idea is the following: <strong><em>the whole data set is your model.</em></strong></p>



<p>Yes, you read that right. </p>



<p>The KNN machine learning model is nothing more than a set of observations. Every single instance of your training data is a part of your model. Training becomes as simple as throwing the training data into a container data structure for later retrieval. There&#8217;s no complicated inference phase and hours of distributed GPU processing to extract patterns from the data.</p>



<h2 class="wp-block-heading">KNN Inference</h2>



<p>A great advantage is that you can use the KNN Algorithm for prediction or classification – as you like. You execute the following strategy, given your input vector <code>x</code>.</p>



<ul class="wp-block-list"><li>Find the K nearest neighbors of <code>x</code> according to a predefined <em>similarity metric</em>.</li><li>Aggregate the K nearest neighbors into a single “prediction” or “classification” value. You can use any aggregator function such as <a href="https://blog.finxter.com/numpy-average-along-axis/" target="_blank" rel="noreferrer noopener" title="[NumPy] How to Calculate The Average Along an Axis?">average</a>, mean, <a href="https://blog.finxter.com/python-max/" target="_blank" rel="noreferrer noopener" title="Python max() — A Simple Illustrated Guide">max</a>, <a href="https://blog.finxter.com/python-min/" target="_blank" rel="noreferrer noopener" title="Python min() — A Simple Illustrated Guide">min</a>, etc.</li></ul>



<p>That’s it. Simple, isn’t it?</p>



<p>Check out the following graphic:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-8.png" alt="" class="wp-image-2446" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-8.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-8-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-8-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure></div>



<p>Suppose, your
company sells homes for clients. It has acquired a large database of customers
and experienced house prices. </p>



<p>One day,
your client asks how much he can expect to pay for a house with 52 square
meters. &nbsp;You query your KNN “model” and
it immediately gives you the response $33,167. And indeed, your client finds a
home for $33,489 the same week. How did the KNN system come to this surprisingly
accurate prediction?</p>



<p>It simply calculated
the K=3 nearest neighbors to the query “D=52 square meters” from the model with
regards Euclidean distance. The three nearest neighbors are A, B, and C with prices
$34,000, $33,500, and $32,000, respectively. In the final step, the KNN
aggregates the three nearest neighbors by calculating the simple average. As
K=3 in this example, we denote the model as “3NN”.</p>



<p>Of course,
you can vary the similarity functions, the parameter K, and the aggregation
method to come up with more sophisticated prediction models. </p>



<p>Another
advantage of KNN is that it can be easily adapted as new observations are made.
This is not generally true for any machine learning model. A weakness in this
regard is obviously that the computational complexity becomes harder and
harder, the more points you add. To accommodate for that, you can continuously remove
“stale” values from the system.</p>



<p>As I
mentioned above, you can also use KNN for classification problems. Instead of
averaging over the K nearest neighbors, you can simply use a voting mechanism
where each nearest neighbor votes for its class. The class with the most votes
wins.</p>



<h2 class="wp-block-heading">Implementing KNN with SKLearn</h2>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Dependencies
from sklearn.neighbors import KNeighborsRegressor
import numpy as np


## Data (House Size (square meters) / Hous Price ($))
X = np.array([[35, 30000], [45, 45000], [40, 50000],
              [35, 35000], [25, 32500], [40, 40000]])


## One-liner
KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1].reshape(-1,1))


## Result &amp; puzzle
res = KNN.predict([[30]])
print(res)
</pre>



<p>Let’s dive into how to use KNN in Python – in a <a href="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener" title="https://pythononeliners.com/">single line of code</a>.</p>



<p>Take a
guess: what’s the output of this code snippet?</p>



<h2 class="wp-block-heading">Understanding the Code</h2>



<p>To help you see the result, let’s plot the housing data from the code:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="605" height="454" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-9.png" alt="" class="wp-image-2447" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-9.png 605w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-9-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-9-100x75.png 100w" sizes="auto, (max-width: 605px) 100vw, 605px" /></figure>



<p>Can you see
the general trend? With growing size of your house, you can expect a linear
growth of its market price. Double the square meters and the price will double,
too. </p>



<p>In the code, the client requests your price prediction for a house with 30 square meters. What does KNN with K=3 (in short: 3NN) predict?</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="605" height="454" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-10.png" alt="" class="wp-image-2448" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-10.png 605w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-10-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-10-100x75.png 100w" sizes="auto, (max-width: 605px) 100vw, 605px" /></figure>



<p>Beautifully
simple, isn’t it? The KNN algorithm finds the three closest houses with respect
to house size and averages the predicted house price as the average of the K=3
nearest neighbors.</p>



<p>Thus, the result
is $32,500.</p>



<p>Maybe you were confused by the data conversion part within the one-liner. Let me quickly explain what happened here:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## One-liner
KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1].reshape(-1,1))</pre>



<p>First, we
create a new machine learning model called “KNeighborsRegressor”. If you would
like to take KNN for classification, you would take the model “KNeighborsClassifier”.
</p>



<p>Second, we “train” the model using the <code><a href="https://blog.finxter.com/sklearn-fit-vs-transform-vs-fit_transform-whats-the-difference/" title="Sklearn fit() vs transform() vs fit_transform() – What’s the Difference?">fit</a></code> function with two parameters. The first parameter defines the input (the house size) and the second parameter defines the output (the house price). The shape of both parameters must be so that each observation is an array-like data structure. For example, you wouldn’t use “<code>30</code>” as an input but “<code>[30]</code>”. The reason is that, in general, the input can be multi-dimensional rather than one-dimensional. Therefore, we <a href="https://blog.finxter.com/numpy-reshape/" target="_blank" rel="noreferrer noopener" title="The Ultimate Guide to NumPy Reshape() in Python">reshape </a>the input:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(X[:,0])
"[35 45 40 35 25 40]"</pre>



<p>If we would use this 1D NumPy array as an input to the <code>fit()</code> function, the function would not work properly because it expects an array of (array-like) observations – and not an array of integers.</p>



<p>Therefore, we convert the array accordingly using the <code><a href="https://blog.finxter.com/reshape-average-stock-data/" target="_blank" rel="noreferrer noopener" title="NumPy Reshape 1D to 2D">reshape()</a></code> function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(X[:,0].reshape(-1,1))
"""
[[35]
 [45]
 [40]
 [35]
 [25]
 [40]]
"""</pre>



<p>Now, we have six array-like observations. The negative index <code>-1</code> in the <code>reshape()</code> function call is our “laziness” expression: we want NumPy to determine the number of rows automatically – and only specify how many columns we need (i.e., 1 column).</p>



<p>This article is based on a book chapter of my book <em>Python One-Liners</em>:</p>



<h2 class="wp-block-heading">Python One-Liners Book: Master the Single Line First!</h2>



<p><strong>Python programmers will improve their computer science skills with these useful one-liners.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-medium is-resized"><a href="https://www.amazon.com/gp/product/B07ZY7XMX8" target="_blank" rel="noopener noreferrer"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-1024x944.jpg" alt="Python One-Liners" class="wp-image-10007" width="512" height="472" srcset="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-scaled.jpg 1024w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-300x277.jpg 300w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-768x708.jpg 768w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>



<p><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Python One-Liners</em> </a>will teach you how to read and write &#8220;one-liners&#8221;: <strong><em>concise statements of useful functionality packed into a single line of code. </em></strong>You&#8217;ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.</p>



<p>The book&#8217;s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms. </p>



<p>Detailed explanations of one-liners introduce <strong><em>key computer science concepts </em></strong>and<strong><em> boost your coding and analytical skills</em></strong>. You&#8217;ll learn about advanced Python features such as <em><strong>list comprehension</strong></em>, <strong><em>slicing</em></strong>, <strong><em>lambda functions</em></strong>, <strong><em>regular expressions</em></strong>, <strong><em>map </em></strong>and <strong><em>reduce </em></strong>functions, and <strong><em>slice assignments</em></strong>. </p>



<p>You&#8217;ll also learn how to:</p>



<ul class="wp-block-list"><li>Leverage data structures to <strong>solve real-world problems</strong>, like using Boolean indexing to find cities with above-average pollution</li><li>Use <strong>NumPy basics</strong> such as <em>array</em>, <em>shape</em>, <em>axis</em>, <em>type</em>, <em>broadcasting</em>, <em>advanced indexing</em>, <em>slicing</em>, <em>sorting</em>, <em>searching</em>, <em>aggregating</em>, and <em>statistics</em></li><li>Calculate basic <strong>statistics </strong>of multidimensional data arrays and the K-Means algorithms for unsupervised learning</li><li>Create more <strong>advanced regular expressions</strong> using <em>grouping </em>and <em>named groups</em>, <em>negative lookaheads</em>, <em>escaped characters</em>, <em>whitespaces, character sets</em> (and <em>negative characters sets</em>), and <em>greedy/nongreedy operators</em></li><li>Understand a wide range of <strong>computer science topics</strong>, including <em>anagrams</em>, <em>palindromes</em>, <em>supersets</em>, <em>permutations</em>, <em>factorials</em>, <em>prime numbers</em>, <em>Fibonacci </em>numbers, <em>obfuscation</em>, <em>searching</em>, and <em>algorithmic sorting</em></li></ul>



<p>By the end of the book, you&#8217;ll know how to <strong><em>write Python at its most refined</em></strong>, and create concise, beautiful pieces of &#8220;Python art&#8221; in merely a single line.</p>



<p><strong><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Get your Python One-Liners on Amazon!!</em></a></strong></p>



<h2 class="wp-block-heading">Where to Go From Here?</h2>



<p>Understanding algorithms is hard enough.</p>



<p>Why do so many people struggle with algorithms?</p>



<p>Yes, complexity may be an issue from time to time. But in so many cases, the real problem is a lack of your quick and confident understanding of the very basics of code.</p>



<p>Proof: have you ever observed that you can easily understand algorithms visually but not in code?</p>



<p>There is only one solution: master the basics until you don&#8217;t have to think about them. Only then can your brain handle the higher-level complexity of algorithms.</p>



<p>To help you achieve this, I invest most of my time and effort in creating the best free Python email course <g class="gr_ gr_4 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="4" data-gr-id="4">in</g> the web. Join my community of more than 66,000 ambitious Python coders!</p>






<p><br></p>
<p>The post <a href="https://blog.finxter.com/k-nearest-neighbors-as-a-python-one-liner/">K-Nearest Neighbors (KNN) with sklearn in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>[Tutorial] K-Means Clustering with SKLearn in One Line</title>
		<link>https://blog.finxter.com/tutorial-how-to-run-k-means-clustering-in-1-line-of-python/</link>
					<comments>https://blog.finxter.com/tutorial-how-to-run-k-means-clustering-in-1-line-of-python/#respond</comments>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Fri, 04 Jun 2021 12:38:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NumPy]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=2385</guid>

					<description><![CDATA[<p>If there is one clustering algorithm you need to know – whether you are a computer scientist, data scientist, or machine learning expert – it&#8217;s the K-Means algorithm. In this tutorial drawn from my book Python One-Liners, you’ll learn the general idea and when and how to use it in a single line of Python ... <a title="[Tutorial] K-Means Clustering with SKLearn in One Line" class="read-more" href="https://blog.finxter.com/tutorial-how-to-run-k-means-clustering-in-1-line-of-python/" aria-label="Read more about [Tutorial] K-Means Clustering with SKLearn in One Line">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/tutorial-how-to-run-k-means-clustering-in-1-line-of-python/">[Tutorial] K-Means Clustering with SKLearn in One Line</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>If there is one clustering algorithm you need to know – whether you are a computer scientist, data scientist, or machine learning expert – it&#8217;s the K-Means algorithm. In this tutorial drawn from my book <a href="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener" title="https://pythononeliners.com/">Python One-Liners</a>, you’ll learn the general idea and when and how to use it in a single line of Python code using the <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">sklearn </a>library.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="K-Means Clustering Made Simple" width="937" height="527" src="https://www.youtube.com/embed/NPpVnFWFe4U?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Labeled vs Unlabeled Training</h2>



<p>You may know about supervised learning where the<strong> training data is “labeled”</strong>, i.e., we know the output value of every input value in the training data. But in practice, this is not always the case. What if you have “unlabeled” data? Especially in many data analytics applications, there is no such thing as “the optimal output”. Prediction is not the goal here – but you can still distill useful knowledge from these unlabeled data sets.</p>



<p>For example, suppose you are working in a startup that serves different target markets with various income levels and ages. Your boss tells you to find a certain number of target “personas” that best fit your different target markets.</p>



<p>It’s time to learn about “unsupervised learning” with <strong>unlabeled training data</strong>. In particular, you can use clustering methods to identify the “average customer personas” which your company serves. </p>



<p>Here is an
example:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="605" height="454" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-4.png" alt="" class="wp-image-2386" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-4.png 605w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-4-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-4-100x75.png 100w" sizes="auto, (max-width: 605px) 100vw, 605px" /></figure>



<p>Visually, you can easily see three types of Personas with different types of incomes and ages. But how to find those algorithmically? This is the domain of clustering algorithms such as the widely popular K-Means algorithm.</p>



<h2 class="wp-block-heading">Finding the Cluster Centers</h2>



<p class="has-pale-cyan-blue-background-color has-background">Given the
data sets and an integer k, the K-Means algorithm finds k clusters of data such
that the difference between the k cluster centers (=the centroid of the data in
each cluster) and the data in the k cluster is minimal. </p>



<p>In other words, we can find the different personas by running the K-Means algorithm on our data sets:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="605" height="454" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-5.png" alt="" class="wp-image-2387" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-5.png 605w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-5-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-5-100x75.png 100w" sizes="auto, (max-width: 605px) 100vw, 605px" /></figure>



<p>The cluster centers (black dots) fit very nicely to the overall data. Every cluster center can be viewed as one customer persona. Thus, we have three idealized personas: </p>



<ul class="wp-block-list"><li>A 20-year-old earning $2000,</li><li>A 25-year-old earning $3000, and </li><li>A 40-year-old earning $4000. </li></ul>



<p>And the
great thing is that the K-Means algorithm finds those cluster centers completely
automated – even in a high-dimensional space (where it would be hard for humans
to find the personas visually). </p>



<p>As a small side note: The K-Means algorithm requires “the number of cluster centers k” as an input. In this case, we use domain knowledge and “magically” defined <em>k=3</em>. There are <a href="https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set" target="_blank" rel="noreferrer noopener">more advanced algorithms</a> that find the number of cluster centers automatically.</p>



<h2 class="wp-block-heading">K-Means Algorithm Overview</h2>



<p>So how does
the K-Means algorithm work? In a nutshell, it performs the following procedure:</p>



<ol class="wp-block-list"><li>Initialize random cluster centers (centroids).</li><li>Repeat until convergence<ul><li>Assign every data point to its closest cluster center.</li><li>Recompute each cluster center to the centroid of all data points assigned to it.</li></ul></li></ol>



<h2 class="wp-block-heading">KMeans Code Using Sklearn</h2>



<p>How can we do all of this in a single line of code? Fortunately, the <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have" target="_blank" rel="noreferrer noopener">Scikit-learn</a> library in Python has already implemented the K-Means algorithm in a very efficient manner. </p>



<p>So here is the one-liner code snippet that does K-Means clustering for you:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="12" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Dependencies
from sklearn.cluster import KMeans
import numpy as np


## Data (Work (h) / Salary ($))
X = np.array([[35, 7000], [45, 6900], [70, 7100],
              [20, 2000], [25, 2200], [15, 1800]])


## One-liner
kmeans = KMeans(n_clusters=2).fit(X)


## Result &amp; puzzle
cc = kmeans.cluster_centers_
print(cc)</pre>



<p><strong>Python Puzzle: What’s the output of this code snippet? </strong></p>



<p>Try to guess a solution without understanding every syntactical element!</p>



<p><em>(In the next paragraphs, I will give you the result of this code puzzle. In my opinion, puzzle-based learning is one of the best </em><g class="gr_ gr_5 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="5" data-gr-id="5"><em>ways</em></g><em> to acquire the basics of programming. That&#8217;s why I have written the book &#8220;</em><a href="https://blog.finxter.com/coffee-break-python/" target="_blank" rel="noreferrer noopener"><em>Coffee Break Python</em></a><em>&#8221; to learn Python faster &#8212; and to fit learning in any daily schedule.) <br></em></p>



<h2 class="wp-block-heading">Code Explanation</h2>



<p>In the first lines, we import the KMeans module from the <g class="gr_ gr_6 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" id="6" data-gr-id="6">sklearn</g>.cluster package. This module takes over the clustering itself. Also, we need to import the <a href="https://blog.finxter.com/numpy-tutorial/" target="_blank" rel="noreferrer noopener" title="NumPy Tutorial – Everything You Need to Know to Get Started">NumPy </a>library because the KMeans module works on NumPy arrays. </p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="605" height="454" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-6.png" alt="" class="wp-image-2388" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-6.png 605w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-6-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-6-100x75.png 100w" sizes="auto, (max-width: 605px) 100vw, 605px" /></figure>



<p>The data is
two-dimensional. It correlates the number of working hours with the salary of
some workers. There are six data points in this employee data set: </p>



<p>The goal is
to find the two cluster centers that fits best to this data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## One-liner
kmeans = KMeans(n_clusters=2).fit(X)</pre>



<p>In the one-liner, we explicitly define the number of cluster centers using the function argument <code>n_clusters</code>. First, we create a new KMeans object that handles the algorithm for us. We simply call the instance method <code><a href="https://blog.finxter.com/sklearn-fit-vs-transform-vs-fit_transform-whats-the-difference/" target="_blank" rel="noreferrer noopener" title="Sklearn fit() vs transform() vs fit_transform() – What’s the Difference?">fit(X)</a></code> to run the K-Means algorithm on our input data <code>X</code>. The KMeans object now holds all the results. All which is left is to retrieve the results from its <a href="https://blog.finxter.com/python-attributes/" target="_blank" rel="noreferrer noopener" title="Python Class vs Instance Attributes [Tutorial+Video]">attributes</a>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cc = kmeans.cluster_centers_
print(cc)</pre>



<p>So, what are the cluster centers and what is the output of this code snippet?</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="605" height="454" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-7.png" alt="" class="wp-image-2389" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-7.png 605w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-7-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-7-100x75.png 100w" sizes="auto, (max-width: 605px) 100vw, 605px" /></figure>



<p>In the graphic, you can see that the two cluster centers are (20, 2000) and (50, 7000). This is also the result of the Python one-liner. </p>



<h2 class="wp-block-heading">Python One-Liners Book: Master the Single Line First!</h2>



<p><strong>Python programmers will improve their computer science skills with these useful one-liners.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-medium is-resized"><a href="https://www.amazon.com/gp/product/B07ZY7XMX8" target="_blank" rel="noopener noreferrer"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-1024x944.jpg" alt="Python One-Liners" class="wp-image-10007" width="512" height="472" srcset="https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-scaled.jpg 1024w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-300x277.jpg 300w, https://blog.finxter.com/wp-content/uploads/2020/06/3D_cover-768x708.jpg 768w" sizes="auto, (max-width: 512px) 100vw, 512px" /></a></figure></div>



<p><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Python One-Liners</em> </a>will teach you how to read and write &#8220;one-liners&#8221;: <strong><em>concise statements of useful functionality packed into a single line of code. </em></strong>You&#8217;ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.</p>



<p>The book&#8217;s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms. </p>



<p>Detailed explanations of one-liners introduce <strong><em>key computer science concepts </em></strong>and<strong><em> boost your coding and analytical skills</em></strong>. You&#8217;ll learn about advanced Python features such as <em><strong>list comprehension</strong></em>, <strong><em>slicing</em></strong>, <strong><em>lambda functions</em></strong>, <strong><em>regular expressions</em></strong>, <strong><em>map </em></strong>and <strong><em>reduce </em></strong>functions, and <strong><em>slice assignments</em></strong>. </p>



<p>You&#8217;ll also learn how to:</p>



<ul class="wp-block-list"><li>Leverage data structures to <strong>solve real-world problems</strong>, like using Boolean indexing to find cities with above-average pollution</li><li>Use <strong>NumPy basics</strong> such as <em>array</em>, <em>shape</em>, <em>axis</em>, <em>type</em>, <em>broadcasting</em>, <em>advanced indexing</em>, <em>slicing</em>, <em>sorting</em>, <em>searching</em>, <em>aggregating</em>, and <em>statistics</em></li><li>Calculate basic <strong>statistics </strong>of multidimensional data arrays and the K-Means algorithms for unsupervised learning</li><li>Create more <strong>advanced regular expressions</strong> using <em>grouping </em>and <em>named groups</em>, <em>negative lookaheads</em>, <em>escaped characters</em>, <em>whitespaces, character sets</em> (and <em>negative characters sets</em>), and <em>greedy/nongreedy operators</em></li><li>Understand a wide range of <strong>computer science topics</strong>, including <em>anagrams</em>, <em>palindromes</em>, <em>supersets</em>, <em>permutations</em>, <em>factorials</em>, <em>prime numbers</em>, <em>Fibonacci </em>numbers, <em>obfuscation</em>, <em>searching</em>, and <em>algorithmic sorting</em></li></ul>



<p>By the end of the book, you&#8217;ll know how to <strong><em>write Python at its most refined</em></strong>, and create concise, beautiful pieces of &#8220;Python art&#8221; in merely a single line.</p>



<p><strong><a href="https://amzn.to/2WAYeJE" target="_blank" rel="noreferrer noopener" title="https://amzn.to/2WAYeJE"><em>Get your Python One-Liners on Amazon!!</em></a></strong></p>



<h2 class="wp-block-heading">Where to go from here?</h2>



<p>In this article, you have learned how to run the popular K-Means algorithm in Python &#8212; using only a single line of code.</p>



<p>I know that it can be hard to understand Python code snippets. Every coder is constantly challenged by the difficulty of code. Don&#8217;t let anybody tell you otherwise.</p>



<p>To make learning Python less of a pain, I have created a Python cheat sheet course where I&#8217;ll send you a concise, fresh cheat sheet every week. Join my Python course for free!</p>






<p><br></p>
<p>The post <a href="https://blog.finxter.com/tutorial-how-to-run-k-means-clustering-in-1-line-of-python/">[Tutorial] K-Means Clustering with SKLearn in One Line</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.finxter.com/tutorial-how-to-run-k-means-clustering-in-1-line-of-python/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Python Linear Regression with sklearn &#8211; A Helpful Illustrated Guide</title>
		<link>https://blog.finxter.com/python-linear-regression-1-liner/</link>
					<comments>https://blog.finxter.com/python-linear-regression-1-liner/#comments</comments>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Mon, 26 Apr 2021 11:49:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NumPy]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Python One-Liners]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1920</guid>

					<description><![CDATA[<p>? This tutorial will show you the most simple and straightforward way to implement linear regression in Python&#8212;by using scikit-learn&#8217;s linear regression functionality. I have written this tutorial as part of my book Python One-Liners where I present how expert coders accomplish a lot in a little bit of code. Feel free to bookmark and download ... <a title="Python Linear Regression with sklearn &#8211; A Helpful Illustrated Guide" class="read-more" href="https://blog.finxter.com/python-linear-regression-1-liner/" aria-label="Read more about Python Linear Regression with sklearn &#8211; A Helpful Illustrated Guide">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/python-linear-regression-1-liner/">Python Linear Regression with sklearn &#8211; A Helpful Illustrated Guide</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-pale-cyan-blue-background-color has-background">? This tutorial will show you the <strong>most simple and straightforward way to implement linear regression in Python</strong>&#8212;by using scikit-learn&#8217;s linear regression functionality. I have written this tutorial as part of my book <a href="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener" title="https://pythononeliners.com/">Python One-Liners</a> where I present how expert coders accomplish a lot in a little bit of code. </p>



<p><a href="https://pythononeliners.com/" target="_blank" rel="noreferrer noopener" title="https://pythononeliners.com/">Feel free to bookmark and download the Python One-Liner freebies here. </a></p>



<p>It is really simple to implement linear regression with the <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have" target="_blank" rel="noreferrer noopener">sklearn</a> (short for <em>scikit-learn</em>) library. Have a quick look at this code snippet&#8212;we&#8217;ll explain everything afterward!</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LinearRegression
import numpy as np

## Data (Apple stock prices)
apple = np.array([155, 156, 157])
n = len(apple)


## One-liner
model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)


## Result &amp; puzzle
print(model.predict([[3],[4]]))
# What is the output of this code?
</pre>



<p>This one-liner uses two Python libraries: <a href="http://www.numpy.org/" target="_blank" rel="noreferrer noopener">NumPy </a>and <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">scikit-learn</a>. The former is the de-facto standard library for numerical computations (e.g. matrix operations). The latter is the most comprehensive library for machine learning which implements hundreds of machine learning algorithms and techniques.</p>



<p><strong>So let’s explore the code snippet step by step.</strong></p>



<p>We create a simple dataset of three values: three stock prices of the Apple stock in three consecutive days. The variable <code>apple</code> holds this dataset as a <a href="https://blog.finxter.com/numpy-tutorial/" title="NumPy Tutorial – Everything You Need to Know to Get Started" target="_blank" rel="noreferrer noopener">one-dimensional NumPy array.</a> We also store the length of the NumPy array in the variable <code>n</code>. </p>



<p>Let’s say the goal is to predict the stock value of
the next two days. Such an algorithm could be useful as a benchmark for
algorithmic trading applications (using larger datasets of course). </p>



<p>To achieve this goal, the one-liner uses linear regression and creates a model via the function <code>fit()</code>. But what exactly is a model?</p>



<h2 class="wp-block-heading">Background: What is a Model?</h2>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="627" height="353" src="https://blog.finxter.com/wp-content/uploads/2019/02/image.png" alt="" class="wp-image-1921" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image.png 627w, https://blog.finxter.com/wp-content/uploads/2019/02/image-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-100x56.png 100w" sizes="auto, (max-width: 627px) 100vw, 627px" /></figure>



<p>Think of a <strong>machine learning model as a black box.</strong> You put stuff into the box. We call the input “<strong><em>features</em></strong>” and denote them using the variable <code>x</code> which can be a single value or a multi-dimensional vector of values. Then the box does its magic and processes your input. After a bit of time, you get back the result <code>y</code>. </p>



<p>Now, there are two separate phases: <strong>the training phase and the inference phase</strong>. During the training phase, you tell your model your “dream” output <code>y’</code>. You change the model as long as it does not generate your dream output <code>y’</code>.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="626" height="352" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-1.png" alt="" class="wp-image-1922" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-1.png 626w, https://blog.finxter.com/wp-content/uploads/2019/02/image-1-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-1-100x56.png 100w" sizes="auto, (max-width: 626px) 100vw, 626px" /></figure>



<p>As you keep telling the model your “dream” outputs for many different inputs, you “<strong><em>train</em></strong>” the model using your <strong><em>“training data”</em></strong>. Over time, the model will learn which output you would like to get for certain outputs. </p>



<p>That’s why data is so important in the 21st century: your model will only be as good as it’s training data. <strong>Without good training data, it is guaranteed to fail.</strong></p>



<p>So why is machine learning such a big deal nowadays? The main reason is that models “generalize”, i.e., they can use their experience from the training data to predict outcomes for completely new inputs which they have never seen before. If the model generalizes well, these outputs can be surprisingly accurate compared to the “real” but unknown outputs.</p>



<h2 class="wp-block-heading">Code Explanation</h2>



<p>Now, let’s deconstruct the one-liner which creates the
model:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)</pre>



<p>First, we create a new “empty” model by calling <code>LinearRegression()</code>. How does this model look like?</p>



<p><strong>Every linear regression model consists of certain parameters. For linear regression, the parameters are called “coefficients” because each parameter is the coefficient in a linear equation combining the different input features.</strong></p>



<p>With this information, we can shed some light into our
black box. </p>



<p>Given the input features <code>x_1</code>, <code>x_2</code>, &#8230;, <code>x_k</code>. The linear regression model combines the input features with the coefficients <code>a_1</code>, <code>a_2</code>, &#8230;, <code>a_k</code> to calculate the predicted output y using the formula:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="548" height="35" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-4.png" alt="" class="wp-image-1926" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-4.png 548w, https://blog.finxter.com/wp-content/uploads/2019/02/image-4-300x19.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-4-100x6.png 100w" sizes="auto, (max-width: 548px) 100vw, 548px" /></figure>



<p>In our example, we have only a single input feature <code>x</code> so the formula becomes easier:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="216" height="38" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-3.png" alt="" class="wp-image-1925" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-3.png 216w, https://blog.finxter.com/wp-content/uploads/2019/02/image-3-100x18.png 100w" sizes="auto, (max-width: 216px) 100vw, 216px" /></figure>



<p>In other words, our linear regression model describes a line in the two-dimensional space. The first axis describes the input <code>x</code>. The second axis describes the output <code>x</code>. The line describes the (linear) relationship between input and output. </p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="615" height="461" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-5.png" alt="" class="wp-image-1927" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-5.png 615w, https://blog.finxter.com/wp-content/uploads/2019/02/image-5-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-5-100x75.png 100w" sizes="auto, (max-width: 615px) 100vw, 615px" /></figure>



<p>What is the training data in this space? In our case, the input of the model simply takes the indices of the days: <code>[0, 1, 2]</code> – one day for each stock price [155, 156, 157]. To put it differently:</p>



<ul class="wp-block-list"><li>Input <code>x=0</code> should cause output <code>y=155</code></li><li>Input <code>x=1</code> should cause output <code>y=156</code></li><li>Input <code>x=2</code> should cause output <code>y=157</code></li></ul>



<p>Now, which line fits best to our training data <code>[155, 156, 157]</code>?</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="615" height="461" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-6.png" alt="" class="wp-image-1928" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-6.png 615w, https://blog.finxter.com/wp-content/uploads/2019/02/image-6-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-6-100x75.png 100w" sizes="auto, (max-width: 615px) 100vw, 615px" /></figure>



<p>Here is what the linear regression model computes:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Data (Apple stock prices)
apple = np.array([155, 156, 157])
n = len(apple)


## One-liner
model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)


## Result
print(model.coef_)
# [1.]
print(model.intercept_)
# 155.0
</pre>



<p>You can see that we have two coefficients: 1.0 and 155.0. Let’s put them in our formula for linear regression:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="389" height="49" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-7.png" alt="" class="wp-image-1929" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-7.png 389w, https://blog.finxter.com/wp-content/uploads/2019/02/image-7-300x38.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-7-100x13.png 100w" sizes="auto, (max-width: 389px) 100vw, 389px" /></figure>



<p>Let’s plot both the line and the training data in the same space:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="615" height="461" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-8.png" alt="" class="wp-image-1930" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-8.png 615w, https://blog.finxter.com/wp-content/uploads/2019/02/image-8-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-8-100x75.png 100w" sizes="auto, (max-width: 615px) 100vw, 615px" /></figure>



<p>A perfect fit! Using this model, we can predict the stock price for any value of <code>x</code>. Of course, whether this prediction accurately reflects the real world is another story.</p>



<p>After having trained the model, we use it to predict
the two next days. The Apple dataset consists of three values 155, 156, and 157.
We want to know the fourth and fifth value in this series. Thus, we predict the
values for indices 3 and 4.</p>



<p>Note that both the function <code>fit()</code> and the function <code>predict()</code> require an array with the following format:</p>



<pre class="wp-block-preformatted"> [&lt;training_data_1&gt;,<br> &lt;training_data_2&gt;,<br> …,<br> &lt;training_data_n] </pre>



<p>Each training &nbsp;data value is a sequence of feature value:</p>



<pre class="wp-block-preformatted">&lt;training_data&gt; = [feature_1, feature_2, …,
feature_k]</pre>



<p>Again, here is our one-liner:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)</pre>



<p>In our case, we only have a single feature <code>x</code>. Therefore, we <a href="https://blog.finxter.com/numpy-reshape/" target="_blank" rel="noreferrer noopener" title="The Ultimate Guide to NumPy Reshape() in Python">reshape </a>the NumPy array to the strange looking matrix form:</p>



<pre class="wp-block-preformatted"> [[155],<br> [156],<br> [157]] </pre>



<p>The <code>fit()</code> function takes two arguments: the input features of the training data (see the last paragraph) and the “dream outputs” of these inputs. Of course, our dream outputs are the real stock prices of the Apple stock. The function then repeats testing and tweaking different model parameters (i.e., lines) so that the difference between the predicted model values and the “dream outputs” is minimal. This is called <strong><em>“error minimization”</em></strong>. (To be more precise, the function minimizes the squared difference from the predicted model values and the “dream outputs” so that outliers have a larger impact on the error.)</p>



<p>In our case, the model perfectly fits the training data, so the error is zero. But often it is not possible to find such a linear model. Here is an example of training data that cannot be fit by a single straight line:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt


## Data (Apple stock prices)
apple = np.array([157, 156, 159])
n = len(apple)


## One-liner
model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)


## Result
print(model.predict([[3],[4]]))
# [158. 159.]

x = np.arange(5)
plt.plot(x[:len(apple)], apple, "o", label="apple stock price")
plt.plot(x, model.intercept_ + model.coef_[0]*x, ":",
         label="prediction")
plt.ylabel("y")
plt.xlabel("x")
plt.ylim((154,164))
plt.legend()
plt.show()
</pre>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="615" height="461" src="https://blog.finxter.com/wp-content/uploads/2019/02/image-9.png" alt="" class="wp-image-1931" srcset="https://blog.finxter.com/wp-content/uploads/2019/02/image-9.png 615w, https://blog.finxter.com/wp-content/uploads/2019/02/image-9-300x225.png 300w, https://blog.finxter.com/wp-content/uploads/2019/02/image-9-100x75.png 100w" sizes="auto, (max-width: 615px) 100vw, 615px" /></figure>



<p class="has-text-align-left">In this case, the <code>fit()</code> function finds the line that minimizes the squared error between the training data and the predictions as described above.</p>



<h2 class="wp-block-heading">Where to Go from Here?</h2>



<p>Do you feel like you need to brush up your coding skills? Then join my <a href="https://blog.finxter.com/subscribe/">free &#8220;Coffee Break Python Email Course&#8221;</a>. I&#8217;ll send you cheat sheets, daily Python lessons, and code contests. It&#8217;s fun!</p>
<p>The post <a href="https://blog.finxter.com/python-linear-regression-1-liner/">Python Linear Regression with sklearn &#8211; A Helpful Illustrated Guide</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.finxter.com/python-linear-regression-1-liner/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Logistic Regression Scikit-learn vs Statsmodels</title>
		<link>https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/</link>
		
		<dc:creator><![CDATA[Lukas Halim]]></dc:creator>
		<pubDate>Fri, 05 Feb 2021 15:44:50 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=22984</guid>

					<description><![CDATA[<p>What’s the difference between Statsmodels and Scikit-learn? Both have ordinary least squares and logistic regression, so it seems like Python is giving us two ways to do the same thing. Statsmodels offers modeling from the perspective of statistics. Scikit-learn offers some of the same models from the perspective of machine learning. So we need to ... <a title="Logistic Regression Scikit-learn vs Statsmodels" class="read-more" href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/" aria-label="Read more about Logistic Regression Scikit-learn vs Statsmodels">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/">Logistic Regression Scikit-learn vs Statsmodels</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>What’s the difference between Statsmodels and Scikit-learn? Both have ordinary least squares and logistic regression, so it seems like Python is giving us two ways to do the same thing. Statsmodels offers modeling from the perspective of <em>statistics</em>. Scikit-learn offers some of the same models from the perspective of <em>machine learning</em>.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Logistic Regression Scikit-learn vs Statsmodels" width="937" height="527" src="https://www.youtube.com/embed/inZpIyBm2Us?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>So we need to understand the difference between statistics and machine learning! Statistics makes mathematically valid inferences about a population based on sample data. Statistics answers the question, &#8220;What is the evidence that X is related to Y?&#8221; Machine learning has the goal of optimizing predictive accuracy rather than inference. Machine learning answers the question, &#8220;Given X, what prediction should we make for Y?&#8221;</p>



<p>In the example below, we&#8217;ll create a fake dataset with predictor variables and a binary Y variable. Then we&#8217;ll perform logistic regression with scikit-learn and statsmodels. We&#8217;ll see that scikit-learn allows us to easily tune the model to optimize predictive power. Statsmodels will provide a summary of statistical measures which will be very familiar to those who&#8217;ve used SAS or R.</p>



<p>If you need an intro to Logistic Regression, see <a href="https://blog.finxter.com/logistic-regression-in-one-line-python/">this</a><a href="https://blog.finxter.com/logistic-regression-in-one-line-python/" target="_blank" rel="noreferrer noopener"> </a><a href="https://blog.finxter.com/logistic-regression-in-one-line-python/">Finxter post</a>.</p>



<h2 class="wp-block-heading" id="Create-Fake-Data-for-the-Logistic-Regression-Model">Create Fake Data for the Logistic Regression Model</h2>



<p>I tried using some publicly available data for this exercise but didn&#8217;t find one with the characteristics I wanted. So I decided to create some fake data by using <a href="https://blog.finxter.com/numpy-tutorial/" target="_blank" rel="noreferrer noopener" title="NumPy Tutorial – Everything You Need to Know to Get Started">NumPy</a>! There&#8217;s a post <a href="https://data.library.virginia.edu/simulating-a-logistic-regression-model/" target="_blank" rel="noreferrer noopener">here</a> that explains the math and how to do this in R.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np
import pandas as pd

#The next line is setting the seed for the random number generator so that we get consistent results
rg = np.random.default_rng(seed=0)
#Create an array with 500 rows and 3 columns
X_for_creating_probabilities = rg.normal(size=(500,3))</pre>



<p>Create an array with the first column removed. The deleted column can be thought of as random noise, or as a variable that we don&#8217;t have access to when creating the model.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X1 = np.delete(X_for_creating_probabilities,0,axis=1)
X1[:5]
"""
array([[-0.13210486,  0.64042265],
       [-0.53566937,  0.36159505],
       [ 0.94708096, -0.70373524],
       [-0.62327446,  0.04132598],
       [-0.21879166, -1.24591095]])
"""</pre>



<p>Now we&#8217;ll create two more columns correlated with X1. Datasets often have highly correlated variables. Correlation increases the likelihood of overfitting. Concatenate to get a single array.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">X2 = X1 + .1 * np.random.normal(size=(500,2))
X_predictors = np.concatenate((X1,X2),axis=1)</pre>



<p>We want to create our outcome variable and have it be related to X_predictors. To do that, we use our data as inputs to the logistic regression model to get probabilities. Then we set the outcome variable, Y, to True when the probability is above .5.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">P = 1 / (1 + np.e**(-np.matmul(X_for_creating_probabilities,[1,1,1])))
Y = P > .5
#About half of cases are True
np.mean(Y)
#0.498
﻿</pre>



<p>Now divide the data into training and test data. We&#8217;ll run a logistic regression on the training data, then see how well the model performs on the training data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">#Set the first 50 rows to train the model
X_train = X_predictors[:50]
Y_train = Y[:50]

#Set the remaining rows to test the model
X_test = X_predictors[50:]
Y_test = Y[50:]

print(f"X_train: {len(X_train)} X_test: {len(X_test)}")
#X_train: 50 X_test: 450</pre>



<h2 class="wp-block-heading" id="Logistic-regression-with-Scikit-learn">Logistic regression with Scikit-learn</h2>



<p>We&#8217;re ready to train and test models.</p>



<p>As we train the models, we need to take steps to avoid overfitting. A machine learning model may have very accurate results with the data used to train the model. But this does not mean it will be equally accurate when making predictions with data it hasn&#8217;t seen before. When the model fails to generalize to new data, we say it has &#8220;overfit&#8221; the training data. Overfitting is more likely when there are few observations to train on, and when the model uses many correlated predictors.</p>



<p>How to avoid overfitting? By default, <a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">scikit-learn</a>&#8216;s logistic regression applies regularization. Regularization balances the need for predictive accuracy on the training data with a penalty on the magnitude of the model coefficients. Increasing the penalty reduces the coefficients and hence reduces the likelihood of overfitting. If the penalty is too large, though, it will reduce predictive power on both the training and test data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
scikit_default = LogisticRegression(random_state=0).fit(X_train, Y_train)
print(f"intecept: {scikit_default.intercept_} coeficients: {scikit_default.coef_}")
print(f"train accuracy: {scikit_default.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_default.score(X_test, Y_test)}")
"""
Results will vary slightly, even when you set random_state.
intecept: [-0.44526823] coeficients: [[0.50031563 0.79636504 0.82047214 0.83635656]]
train accuracy: 0.8
test accuracy: 0.8088888888888889
"""</pre>



<p>We can set turn off regularization by setting penalty as none. Applying regularization reduces the magnitude of the coefficients. Setting the penalty to none will increase the coefficients. Notice that the accuracy on the test data decreases. This indicates our model has overfit the training data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
scikit_no_penalty = LogisticRegression(random_state=0,penalty='none').fit(X_train, Y_train)
print(f"intecept: {scikit_no_penalty.intercept_} coeficients: {scikit_no_penalty.coef_}")
print(f"train accuracy: {scikit_no_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_no_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.63388911] coeficients: [[-3.59878438  0.70813119  5.10660019  1.29684873]]
train accuracy: 0.82
test accuracy: 0.7888888888888889
"""
﻿</pre>



<p>C is 1.0 by default. Smaller values of C increase the regularization, so if we set the value to .1 we reduce the magnitude of the coefficients.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.linear_model import LogisticRegression
scikit_bigger_penalty = LogisticRegression(random_state=0,C=.1).fit(X_train, Y_train)
print(f"intecept: {scikit_bigger_penalty.intercept_} \
    coeficients: {scikit_bigger_penalty.coef_}")
print(f"train accuracy: {scikit_bigger_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_bigger_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.13102803]     coeficients: [[0.3021235  0.3919277  0.34359251 0.40332636]]
train accuracy: 0.8
test accuracy: 0.8066666666666666
"""
﻿</pre>



<p>It&#8217;s nice to be able to adjust the smoothing coefficient, but how do we decide the optimal value? Scikit-learn&#8217;s GridSearchCV provides an effective but easy to use method for choosing an optimal value. The &#8220;Grid Search&#8221; in <strong>GridSearch</strong>CV means that we supply a <a href="https://blog.finxter.com/python-dictionary/" target="_blank" rel="noreferrer noopener" title="Python Dictionary – The Ultimate Guide">dictionary </a>with the parameter values we wish to test. The model is fit with all combinations of those values. If we have 4 possible values for C and 2 possible values for solver, we will search through all 4X2=8 combinations.</p>



<h3 class="wp-block-heading" id="GridSearchCV-Searches-Through-This-Grid">GridSearchCV Searches Through This Grid</h3>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>C</th><th>solver</th></tr></thead><tbody><tr><td>.01</td><td>newton-cg</td></tr><tr><td>.1</td><td>newton-cg</td></tr><tr><td>1</td><td>newton-cg</td></tr><tr><td>10</td><td>newton-cg</td></tr><tr><td>.01</td><td>lbfgs</td></tr><tr><td>.1</td><td>lbfgs</td></tr><tr><td>1</td><td>lbfgs</td></tr><tr><td>10</td><td>lbfgs</td></tr></tbody></table></figure>



<p>The &#8220;CV&#8221; in GridSearch<strong>CV</strong> stands for <strong>c</strong>ross-<strong>v</strong>alidation. Cross-validation is the method of segmenting the training data. The model is trained on all but one of the segments and the remaining segment validate the model.</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th>Iteration</th><th>Segment 1</th><th>Segment 2</th><th>Segment 3</th><th>Segment 4</th><th>Segment 5</th></tr></thead><tbody><tr><td>1st Iteration</td><td>Validation</td><td>Train</td><td>Train</td><td>Train</td><td>Train</td></tr><tr><td>2nd Iteration</td><td>Train</td><td>Validation</td><td>Train</td><td>Train</td><td>Train</td></tr><tr><td>3rd Iteration</td><td>Train</td><td>Train</td><td>Validation</td><td>Train</td><td>Train</td></tr><tr><td>4th Iteration</td><td>Train</td><td>Train</td><td>Train</td><td>Validation</td><td>Train</td></tr><tr><td>5th Iteration</td><td>Train</td><td>Train</td><td>Train</td><td>Train</td><td>Validation</td></tr></tbody></table></figure>



<p></p>



<p>GridSearch and cross-validation work in combination. GridsearchCV iterates through values of C and solver for different test and training segments. The algorithm selects the best estimator based performance on the validation segments.</p>



<p>Doing this allows us to determine which values of C and solver work best for our training data. This is how <a href="https://blog.finxter.com/deploying-a-machine-learning-model-in-fastapi/" target="_blank" rel="noreferrer noopener" title="Deploying a machine learning model in FastAPI">scikit-learn</a> helps us to optimize predictive accuracy.</p>



<p>Let&#8217;s see it in action.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from sklearn.model_selection import GridSearchCV
parameters = {'C':[.01, .1, 1, 10],'solver':['newton-cg','lbfgs']}
Logistic = LogisticRegression(random_state=0)
scikit_GridSearchCV = GridSearchCV(Logistic, parameters)
scikit_GridSearchCV.fit(X_train, Y_train)
print(f"best estimator: {scikit_GridSearchCV.best_estimator_}")
#best estimator: LogisticRegression(C=0.1, random_state=0, solver='newton-cg')</pre>



<p>Use the score method returns the mean accuracy on the given test data and labels. Accuracy is the percent of observations correctly predicted.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(f"train accuracy: {scikit_GridSearchCV.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_GridSearchCV.score(X_test, Y_test)}")
"""
train accuracy: 0.82
test accuracy: 0.8133333333333334
"""</pre>



<h2 class="wp-block-heading" id="Logistic-regression-with-Statsmodels">Logistic regression with Statsmodels</h2>



<p>Now let&#8217;s try the same, but with statsmodels. With scikit-learn, to turn off regularization we set <code>penalty='none'</code>, but with statsmodels regularization is turned off by default. A quirk to watch out for is that Statsmodels does not include an intercept by default. To include an intercept, we use the sm.add_constant method.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import statsmodels.api as sm

#adding constant to X
X_train_with_constant = sm.add_constant(X_train)
X_test_with_constant = sm.add_constant(X_test)

# building the model and fitting the data
sm_model_all_predictors = sm.Logit(Y_train, X_train_with_constant).fit()

# printing the summary table
print(sm_model_all_predictors.params)
"""
Optimization terminated successfully.
         Current function value: 0.446973
         Iterations 7
[-0.57361523 -2.00207425  1.28872367  3.53734636  0.77494424]
"""</pre>



<p>If you&#8217;re used to doing logistic regression in R or SAS, what comes next will be familiar. Once we have trained the logistic regression model with statsmodels, the summary method will easily produce a table with statistical measures including p-values and confidence intervals.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sm_model_all_predictors.summary()</pre>



<figure class="wp-block-table is-style-stripes"><table><tbody><tr><th>Dep. Variable:</th><td>y</td><th>No. Observations:</th><td>50</td></tr><tr><th>Model:</th><td>Logit</td><th>Df Residuals:</th><td>45</td></tr><tr><th>Method:</th><td>MLE</td><th>Df Model:</th><td>4</td></tr><tr><th>Date:</th><td>Thu, 04 Feb 2021</td><th>Pseudo R-squ.:</th><td>0.3846</td></tr><tr><th>Time:</th><td>14:33:19</td><th>Log-Likelihood:</th><td>-21.228</td></tr><tr><th>converged:</th><td>True</td><th>LL-Null:</th><td>-34.497</td></tr><tr><th>Covariance Type:</th><td>nonrobust</td><th>LLR p-value:</th><td>2.464e-05</td></tr></tbody></table></figure>



<figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><th>coef</th><th>std err</th><th>z</th><th>P&gt;|z|</th><th>[0.025</th><th>0.975]</th></tr><tr><th>const</th><td>-0.7084</td><td>0.478</td><td>-1.482</td><td>0.138</td><td>-1.645</td><td>0.228</td></tr><tr><th>x1</th><td>5.5486</td><td>4.483</td><td>1.238</td><td>0.216</td><td>-3.237</td><td>14.335</td></tr><tr><th>x2</th><td>10.2566</td><td>5.686</td><td>1.804</td><td>0.071</td><td>-0.887</td><td>21.400</td></tr><tr><th>x3</th><td>-3.9137</td><td>4.295</td><td>-0.911</td><td>0.362</td><td>-12.333</td><td>4.505</td></tr><tr><th>x4</th><td>-7.8510</td><td>5.364</td><td>-1.464</td><td>0.143</td><td>-18.364</td><td>2.662</td></tr></tbody></table></figure>



<p>There&#8217;s a lot here, but we&#8217;ll focus on the second table with the coefficients.</p>



<p>The first column shows the value for the coefficient. The fourth column, with the heading P&gt;|z|, shows the p-values. A p-value is a probability measure, and p-values above .05 are frequently considered, &#8220;not statistically significant.&#8221; None of the predictors are considered statistically significant! This is because we have a relatively small number of observations in our training data and because the predictors are highly correlated. Some statistical packages like R and SAS have built-in methods to select the features to include in the model based on which predictors have low (significant) p-values, but unfortunately, this isn&#8217;t available in statsmodels.</p>



<p>If we try again with just x1 and x2, we&#8217;ll get a completely different result, with very low p-values for x1 and x2, meaning that the evidence for a relationship with the dependent variable is statistically significant. We&#8217;re cheating, though &#8211; because we created the data, we know that we only need x1 and x2.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sm_model_x1_x2 = sm.Logit(Y_train, X_train_with_constant[:,:3]).fit()
sm_model_x1_x2.summary()</pre>



<p>Now we see x1 and x2 are both statistically significant.</p>



<p>Statsmodels doesn&#8217;t have the same accuracy method that we have in scikit-learn. We&#8217;ll use the predict method to predict the probabilities. Then we&#8217;ll use the decision rule that probabilities above .5 are true and all others are false. This is the same rule used when scikit-learn calculates accuracy.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">all_predicted_train = sm_model_all_predictors.predict(X_train_with_constant)>.5
all_predicted_test = sm_model_all_predictors.predict(X_test_with_constant)>.5

x1_x2_predicted_train = sm_model_x1_x2.predict(X_train_with_constant[:,:3])>.5
x1_x2_predicted_test = sm_model_x1_x2.predict(X_test_with_constant[:,:3])>.5

#calculate the accuracy
print(f"train: {(Y_train==all_predicted_train).mean()} and test: {(Y_test==all_predicted_test).mean()}")
print(f"train: {(Y_train==x1_x2_predicted_train).mean()} and test: {(Y_test==x1_x2_predicted_test).mean()}")
"""
train: 0.8 and test: 0.8066666666666666
train: 0.8 and test: 0.8111111111111111
"""</pre>



<h2 class="wp-block-heading" id="Summarizing-The-Results">Summarizing The Results</h2>



<p>Let&#8217;s create a <a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/" target="_blank" rel="noreferrer noopener" title="How to Create a DataFrame in Pandas?">DataFrame </a>with the results. The models have identical accuracy on the training data, but different results on the test data. The models with all the predictors and without smoothing have the worst test accuracy, suggesting that they have overfit on the training data and so do not generalize well to new data.</p>



<p>Even if we use the best methods in creating our model, there is still chance involved in how well it generalizes to the test data.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">lst = [['scikit-learn','default', scikit_default.score(X_train, Y_train),scikit_default.score(X_test, Y_test)],
       ['scikit-learn','no penalty', scikit_no_penalty.score(X_train, Y_train),scikit_no_penalty.score(X_test, Y_test)],
       ['scikit-learn','bigger penalty', scikit_bigger_penalty.score(X_train, Y_train),scikit_bigger_penalty.score(X_test, Y_test)],
       ['scikit-learn','GridSearchCV', scikit_GridSearchCV.score(X_train, Y_train),scikit_GridSearchCV.score(X_test, Y_test)],
       ['statsmodels','include intercept and all predictors', (Y_train==all_predicted_train).mean(),(Y_test==all_predicted_test).mean()],
       ['statsmodels','include intercept and x1 and x2', (Y_train==x1_x2_predicted_train).mean(),(Y_test==x1_x2_predicted_test).mean()]
      ]
df = pd.DataFrame(lst, columns =['package', 'setting','train accuracy','test accuracy'])
df</pre>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th></th><th>package</th><th>setting</th><th>train accuracy</th><th>test accuracy</th></tr></thead><tbody><tr><th>0</th><td>scikit-learn</td><td>default</td><td>0.80</td><td>0.808889</td></tr><tr><th>1</th><td>scikit-learn</td><td>no penalty</td><td>0.78</td><td>0.764444</td></tr><tr><th>2</th><td>scikit-learn</td><td>bigger penalty</td><td>0.82</td><td>0.813333</td></tr><tr><th>3</th><td>scikit-learn</td><td>GridSearchCV</td><td>0.80</td><td>0.808889</td></tr><tr><th>4</th><td>statsmodels</td><td>include intercept and all predictors</td><td>0.78</td><td>0.764444</td></tr><tr><th>5</th><td>statsmodels</td><td>include intercept and x1 and x2</td><td>0.80</td><td>0.811111</td></tr></tbody></table></figure>



<h2 class="wp-block-heading" id="Scikit-learn-vs-Statsmodels">Scikit-learn vs Statsmodels</h2>



<p>Upshot is that you should use Scikit-learn for logistic regression unless you need the statistics results provided by StatsModels.</p>



<p>Here&#8217;s a table of the most relevant similarities and differences:</p>



<figure class="wp-block-table is-style-stripes"><table><thead><tr><th></th><th>Scikit-learn</th><th>Statsmodels</th></tr></thead><tbody><tr><td>Regularization</td><td>Uses L2 regularization by default, but regularization can be turned off using penalty=&#8217;none&#8217;</td><td>Does not use regularization by default</td></tr><tr><td>Hyperparameter tuning</td><td>GridSearchCV allows for easy tuning of regularization parameter</td><td>User will need to write lines of code to tune regularization parameter</td></tr><tr><td>Intercept</td><td>Includes intercept by default</td><td>Use the add_constant method to include an intercept</td></tr><tr><td>Model Evaluation</td><td>The score method reports prediction accuracy</td><td>The summary method shows p-values, confidence intervals, and other statistical measures</td></tr><tr><td>When should you use it?</td><td>For accurate predictions</td><td>For statistical inference.</td></tr><tr><td>Comparison with R and SAS</td><td>Different</td><td>Similar</td></tr></tbody></table></figure>



<p>That&#8217;s it for now! Please check out my other work at <a href="http://learningtableau.com" target="_blank" rel="noreferrer noopener">learningtableau.com</a> and my new site <a href="http://datasciencedrills.com" target="_blank" rel="noreferrer noopener">datasciencedrills.com</a>.</p>
<p>The post <a href="https://blog.finxter.com/logistic-regression-scikit-learn-vs-statsmodels/">Logistic Regression Scikit-learn vs Statsmodels</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Decision Tree Learning &#8212; A Helpful Illustrated Guide in Python</title>
		<link>https://blog.finxter.com/decision-tree-machine-learning/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Thu, 07 Jan 2021 20:07:00 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-learn Library]]></category>
		<category><![CDATA[decision tree]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=424</guid>

					<description><![CDATA[<p>This tutorial will show you everything you need to get started training your first models using decision tree learning in Python. To help you grasp this topic thoroughly, I attacked it from different perspectives: textual, visual, and audio-visual. So, let&#8217;s get started! Why Decision Trees? Deep learning has become the megatrend within artificial intelligence and ... <a title="Decision Tree Learning &#8212; A Helpful Illustrated Guide in Python" class="read-more" href="https://blog.finxter.com/decision-tree-machine-learning/" aria-label="Read more about Decision Tree Learning &#8212; A Helpful Illustrated Guide in Python">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/decision-tree-machine-learning/">Decision Tree Learning &#8212; A Helpful Illustrated Guide in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>This tutorial will show you everything you need to get started training your first models using decision tree learning in Python. To help you grasp this topic thoroughly, I attacked it from different perspectives: textual, visual, and audio-visual. So, let&#8217;s get started!</p>



<h2 class="wp-block-heading">Why Decision Trees?</h2>



<p>Deep learning has become the megatrend within <a href="https://blog.finxter.com/introduction-to-machine-learning-and-its-applications/" title="Introduction To Machine Learning And Its Applications" target="_blank" rel="noreferrer noopener">artificial intelligence and machine learning</a>. Yet, training large neural networks is <strong>not</strong> always the best choice. It&#8217;s the bazooka in machine learning, effective but not efficient.</p>



<p>A human will not understand in practice why the neural network classifies one way or the other. It is just a black box. Should you blindly invest your money into a stock recommended by a neural network? As you do not know the basis of the decision of a neural network, it can be hard to blindly trust its recommendations.</p>



<p>Many ML divisions in large companies must be able to<strong><em> explain the reasoning of their ML algorithms</em></strong>. Deep learning models fail to do this, but this is where decision trees excel! </p>



<p>This is one reason for the popularity of decision trees. <strong><em>Decision trees are more human-friendly and intuitive. </em></strong>You know exactly how the decisions emerged. And you can even hand tune the ML model of you want to.</p>



<div class="wp-block-image"><figure class="aligncenter"><a href="https://blog.finxter.com/wp-content/uploads/2018/07/ML_DecisionTree.png"><img loading="lazy" decoding="async" width="960" height="540" src="https://blog.finxter.com/wp-content/uploads/2018/07/ML_DecisionTree.png" alt="Decision Tree Image" class="wp-image-425" srcset="https://blog.finxter.com/wp-content/uploads/2018/07/ML_DecisionTree.png 960w, https://blog.finxter.com/wp-content/uploads/2018/07/ML_DecisionTree-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2018/07/ML_DecisionTree-768x432.png 768w" sizes="auto, (max-width: 960px) 100vw, 960px" /></a></figure></div>



<p>The decision tree consists of branching nodes and leaf nodes. A branching node is a variable (also called <em>feature</em>) that is given as input to your decision problem. For each possible value of this feature, there is a <em>child node</em>. </p>



<p>A <em>leaf node</em> represents the predicted class given the feature values along the path to the root. Each leaf node has an associated probability, i.e., how often have we seen this particular instance (choice of feature values) in the training data. Moreover, each leaf node has an associated class or output value which is the predicted class of the input given by the branching nodes.</p>



<h2 class="wp-block-heading">Video Decision Trees</h2>



<p>I explain decision trees in this video:</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Decision Tree Learning Made Simple (Python)" width="937" height="527" src="https://www.youtube.com/embed/KuwrWb1fj-4?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>In case you need to refresh your Python skills, feel free to deepen your Python code understanding with the <a href="https://finxter.com" target="_blank" rel="noreferrer noopener">Finxter web app</a>.</p>



<h2 class="wp-block-heading">Explanation Simple Example</h2>



<p>You already know decision trees very well from your own experience. They represent a <strong><em>structured way of making decisions</em></strong> – each decision opening new branches. By answering a bunch of questions, you will finally land on the recommended outcome. </p>



<p>Here is an example:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-13.png" alt="Decision Tree Example" class="wp-image-2480" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-13.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-13-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-13-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure>



<p>Decision trees are used for classification problems such as <em>“which subject should I study, given my interests?”</em>. You start at the top. Now, you repeatedly answer questions (select the choices that describe your features best). Finally, you reach a leaf node of the tree. This is the recommended class based on your feature selection.</p>



<p>There are many nuances to decision tree learning. For example, in the above figure, the first question carries more weight than the last question. If you like <a href="https://blog.finxter.com/python-math-module/" target="_blank" rel="noreferrer noopener" title="Python Math Module [Ultimate Guide]">maths</a>, the decision tree will never recommend you art or linguistics. This is useful because some features may be much more important for the classification decision than others. For example, a classification system that predicts your current health may use your sex (feature) to practically rule out many diseases (classes).</p>



<p>Hence, the order of the decision nodes lends itself for performance optimizations: <strong><em>place the features at the top that have a high impact on the final classification.</em></strong> In decision tree learning will then aggregate the questions that do not have a high impact on the final classification as shown in the next graphic:</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="604" height="340" src="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-14.png" alt="Decision Tree Entropy Example" class="wp-image-2481" srcset="https://blog.finxter.com/wp-content/uploads/2019/03/grafik-14.png 604w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-14-300x169.png 300w, https://blog.finxter.com/wp-content/uploads/2019/03/grafik-14-100x56.png 100w" sizes="auto, (max-width: 604px) 100vw, 604px" /></figure>



<p>Suppose the full decision tree looks like the tree on the left. For any combination of features, there is a separate classification outcome (the tree leaves). However, some features may not give you any additional information with respect to the classification problem (e.g. the first “Language” decision node in the example). Decision tree learning would effectively get rid of these nodes for efficiency reasons. This is called “pruning”.</p>



<h2 class="wp-block-heading">Decision Tree Code in Python</h2>



<p>Here&#8217;s some code on how you can run a decision tree in Python using the <code><a href="https://blog.finxter.com/scikit-learn-cheat-sheets/" target="_blank" rel="noreferrer noopener" title="[Collection] 10 Scikit-Learn Cheat Sheets Every Machine Learning Engineer Must Have">sklearn</a></code> library for machine learning:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">## Dependencies
import numpy as np
from sklearn import tree


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [1, 8, 1, "literature"],
              [5, 7, 9, "art"]])


## One-liner
Tree = tree.DecisionTreeClassifier().fit(X[:,:-1], X[:,-1])

## Result &amp; puzzle
student_0 = Tree.predict([[8, 6, 5]])
print(student_0)

student_1 = Tree.predict([[3, 7, 9]])
print(student_1)</pre>



<p>The data in the code snippet describes three students with their estimated skill level (a score between 1-10) in the three areas math, language, and creativity. We also know the study subjects of these students. For example, the first student is highly skilled in maths and studies computer science. The second student is skilled in language much more than in the other two skills and studies literature. The third student is good in creativity and studies art.</p>



<p>The <a href="https://blog.finxter.com/decision-tree-learning-in-one-line-python/" target="_blank" rel="noreferrer noopener" title="Decision Tree Learning in One Line Python">one-liner</a> creates a new decision tree object and trains the model using the <code>fit</code> function on the labeled training data (the last column is the label). Internally, it creates three nodes, one for each feature math, language, and creativity.</p>



<p>When predicting the class of the <code>student_0 (math=8, language=6, creativity=5)</code>, the decision tree returns <code>“computer science”</code>. It has learned that this feature pattern <em><strong>(high, medium, medium)</strong></em> is an indicator for the first class. On the other hand, when asked for <code>(3, 7, 9)</code>, the decision tree predicts <code>“art”</code> because it has learned that the score <strong><em>(low, medium, high)</em></strong> hints to the third class.</p>



<p>Note that the algorithm is non-deterministic. In other words, when executing the same code twice, different results may arise. This is common for machine learning algorithms that work with <a href="https://blog.finxter.com/python-random-module/" target="_blank" rel="noreferrer noopener" title="Python’s Random Module – Everything You Need to Know to Get Started">random </a>generators. In this case, the order of the features is randomly permuted, so the final decision tree may have a different order of the features.</p>



<h2 class="wp-block-heading">Where to Go From Here?</h2>



<p>Enough theory. Let’s get some practice!</p>



<p>Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation. </p>



<p>To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?</p>



<p><strong>You build high-value coding skills by working on practical coding projects!</strong></p>



<p>Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?</p>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f680.png" alt="🚀" class="wp-smiley" style="height: 1em; max-height: 1em;" /> If your answer is <strong><em>YES!</em></strong>, consider becoming a <a rel="noreferrer noopener" href="https://blog.finxter.com/become-python-freelancer-course/" data-type="page" data-id="2072" target="_blank">Python freelance developer</a>! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.</p>



<p>If you just want to learn about the freelancing opportunity, feel free to watch my free webinar <a rel="noreferrer noopener" href="https://blog.finxter.com/webinar-freelancer/" target="_blank">“How to Build Your High-Income Skill Python”</a> and learn how I grew my coding business online and how you can, too—from the comfort of your own home.</p>



<p><a href="https://blog.finxter.com/webinar-freelancer/" target="_blank" rel="noreferrer noopener">Join the free webinar now!</a></p>
<p>The post <a href="https://blog.finxter.com/decision-tree-machine-learning/">Decision Tree Learning &#8212; A Helpful Illustrated Guide in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Minified using Disk

Served from: blog.finxter.com @ 2026-06-26 11:26:30 by W3 Total Cache
-->