SciPy Archives - Be on the Right Side of Change

How to Fit a Curve to Power-law Distributed Data in Python

Chris — Sun, 31 Mar 2024 07:21:01 +0000

In this tutorial, you’ll learn how to generate synthetic data that follows a power-law distribution, plot its cumulative distribution function (CDF), and fit a power-law curve to this CDF using Python. This process is useful for analyzing datasets that follow power-law distributions, which are common in natural and social phenomena.

Prerequisites

Ensure you have Python installed, along with the numpy, matplotlib, and scipy libraries. If not, you can install them using pip:

pip install numpy matplotlib scipy

Step 1: Generate Power-law Distributed Data

First, we’ll generate a dataset that follows a power-law distribution using numpy.

import numpy as np

# Parameters
alpha = 3.0  # Exponent of the distribution
size = 1000  # Number of data points

# Generate power-law distributed data
data = np.random.power(a=alpha, size=size)

How to Generate and Plot Random Samples from a Power-Law Distribution in Python?

The data looks like this:

Let’s make some sense out of it and plot it in 2D space:

Step 2: Plot the Cumulative Distribution Function (CDF)

Next, we’ll plot the CDF of the generated data on a log-log scale to visualize its power-law distribution.

import matplotlib.pyplot as plt

# Prepare data for the CDF plot
sorted_data = np.sort(data)
yvals = np.arange(1, len(sorted_data) + 1) / float(len(sorted_data))

# Plot the CDF
plt.plot(sorted_data, yvals, marker='.', linestyle='none', color='blue')
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')
plt.title('CDF of Power-law Distributed Data')
plt.xscale('log')
plt.yscale('log')
plt.grid(True, which="both", ls="--")
plt.show()

The plot:

Step 3: Fit a Power-law Curve to the CDF

To understand the underlying power-law distribution better, we fit a curve to the CDF using the curve_fit function from scipy.optimize.

from scipy.optimize import curve_fit

# Power-law fitting function
def power_law_fit(x, a, b):
    return a * np.power(x, b)

# Fit the power-law curve
params, covariance = curve_fit(power_law_fit, sorted_data, yvals)

# Generate fitted values
fitted_yvals = power_law_fit(sorted_data, *params)

Step 4: Plot the Fitted Curve with the CDF

Finally, we’ll overlay the fitted power-law curve on the original CDF plot to visually assess the fit.

# Plot the original CDF and the fitted power-law curve
plt.plot(sorted_data, yvals, marker='.', linestyle='none', color='blue', label='Original Data')
plt.plot(sorted_data, fitted_yvals, 'r-', label='Fitted Power-law Curve')
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')
plt.title('CDF with Fitted Power-law Curve')
plt.xscale('log')
plt.yscale('log')
plt.grid(True, which="both", ls="--")
plt.legend()
plt.show()

Voilà!

This visualization helps in assessing the accuracy of the power-law model in describing the distribution of the data.

Sample a Random Number from a Probability Distribution in Python

Shubham Sayon — Sun, 19 Jun 2022 05:28:58 +0000

Problem Formulation

Challenge: Given a list. How will you select a number randomly from the list using probability distribution?

When you select a number randomly from a list using a given probability distribution, the output number generated will be a number returned based on the relative weights (probability) of the given numbers. Let’s try to visualize this with the help of an example.

Example:

Given:
numbers = [10, 20, 30]
distributions = [0.3, 0.2, 0.5]

Expected Output: Choose the elements randomly from the given list and display 5 elements in the output list: 
[30, 10, 20, 30, 30] 

Note: The output can vary.

The expected output has the number ’30’ three times since it has the highest weight/probability. The relative weights assigned are 0.3, 0.2 and 0.5, respectively. This means:

Chances of selecting 10 are 30%.
Chances of selecting 20 are 20%.
Chances of selecting 30 are 50%.

Note: We will first have a look at the numerous ways of solving the given question and then dive into a couple of exercises for further clarity. So without further delay, let’s dive into our mission-critical question and solve it.

Quick Video Explanation:

Method 1: Using random.choices

choices() is a method of the random module in Python that returns a list containing randomly selected items from the specified sequence. This sequence can be a list, tuple, string, or any other kind of sequence.
The possibility to pick weights can be specified using the weights or the cum_weights parameter.

Syntax:
random.choices(sequence, weights=None, cum_weights=None, k=1)

Parameter	Description
sequence	– It is a mandatory parameter. – Represents a sequence like a range of numbers, a list, a tuple, etc.
weights	– It is an optional parameter. – Represents a list wherein the possibility for each value can be weighed. – By default, it is None.
cum_weights	– It is an optional parameter. – Represents a list where the possibility for each value can be weighed. However, the possibility, in this case, is accumulated. For example: normal weights: `[2, 3, 5]` is equivalent to the cum_weights: `[2, 5, 10]`. – By default, it is None.
k	– It is an optional parameter. – Represents an integer that determines the length of the returned list.

Approach: Call the random.choices() function and feed in the given list and the weights/probability distributions as parameters.

Code:

import random
numbers = [10, 20, 30]
distributions = [0.3, 0.2, 0.5]
random_number = random.choices(numbers, distributions, k=5)
print(random_number)

Output:

[10, 30, 30, 10, 20]

Caution:

If the relative or cumulative weight is not specified, then the random.choices() function will automatically select elements with equal probability.
The specified weights should always be of the same length as the specified sequence.
If you specify relative weights as well as cumulative weight at the same time, you will get a TypeError (TypeError: Cannot specify both weights and cumulative weights). Hence, to avoid the error, do not specify both at the same time.
The cum_weights or weights can only be integers, floats, and fractions. They cannot be decimals. Also, you must ensure that the weights are non-negative.

Method 2: Using numpy.random.choice

Another way to sample a random number from a probability distribution is to use the numpy.random.choice() function.

choice() is a method of the numpy.random module that allows you to generate a random value based on a numpy array. It accepts an array as a parameter and randomly returns one of the values from the array.

Syntax:
numpy.random.choice(arr, k, p)

Parameter	Description
arr	– Represents the array containing the sequence of random numbers.
k	– Represents an integer that determines the length of the returned list.
p	– Represents a list where the possibility for each value can be weighed. In simple words, it is the probability distribution of each value of the given array.

Approach: Use the numpy.random.choice(li, size, replace, weights) function such that replace is set to True to return a list of the required size from the list li with respect to a list of corresponding weight sequences weights.

Code:

import numpy as np
numbers = [10, 20, 30]
distributions = [0.3, 0.2, 0.5]
random_number = np.random.choice(numbers, 5, True, distributions)
print(random_number)

Output:

[30 20 30 10 30]

Do you want to become a NumPy master? Check out our interactive puzzle book Coffee Break NumPy and boost your data science skills! (Amazon link opens in new tab.)

Method 3: Using Scipy

Scipy is another hand library to deal with random weighted distributions.

rv_discrete is a base class that is used to construct specific distribution instances and classes for discrete random variables. It is also used to construct an arbitrary distribution defined by a list of support points and corresponding probabilities. [source: Official Documentation]

Explanation: In the following code snippet rv_discrete() takes the sequence of integer values that are contained in the list numbers as the first argument and the probability distributions/weights as the second argument and returns random values from the list based on their relative weigths/probability ditributions.

Code:

from scipy.stats import rv_discrete
numbers = [10, 20, 30]
distributions = [0.3, 0.2, 0.5]
d = rv_discrete(values=(numbers, distributions))
print(d.rvs(size=5))

Output:

[30 10 30 30 20]

Method 4: Using Lea

Another effective Python library that helps us to work with probability distributions is Lea. It is specifically designed to facilitate you to model a wide range of random phenomenons, like coin tossing, gambling, It allows you to model a broad range of random phenomenons, like dice throwing, coin tossing, gambling results, weather forecast, finance, etc.

#Note: Since lea is an external library, you must install it before using it. Here’s the command to install lea in your system: pip install lea

Code:

import lea

numbers = [10, 20, 30]
distributions = [0.3, 0.2, 0.5]
d = tuple(zip(numbers, distributions))
print(lea.pmf(d).random(5))

Output:

(30, 30, 30, 10, 20)

Exercises

Question 1: Our friend Harry has eight coloured crayons: [“red”, “green”, “blue”, “yellow”, “black”, “white”, “pink”, “orange”]. Harry has the weighted preference for selecting each color as: [1/24, 1/6, 1/6, 1/12, 1/12, 1/24, 1/8, 7/24]. He is only allowed to select three colors at once. Find the various combinations he can select in 10 attempts.

Solution:

import random
colors = ["red", "green", "blue", "yellow", "black", "white", "pink", "orange"]
distributions = [1/24, 1/6, 1/6, 1/12, 1/12, 1/24, 1/8, 7/24]
for i in range(10):
    choices = random.choices(colors, distributions, k=3)
    print(choices)

Output:

['orange', 'pink', 'green']
['blue', 'yellow', 'yellow']
['orange', 'green', 'black']
['blue', 'red', 'blue']
['orange', 'orange', 'red']
['orange', 'green', 'blue']
['orange', 'black', 'blue']
['black', 'yellow', 'green']
['pink', 'orange', 'orange']
['blue', 'blue', 'white']

Question 2:

Given:
cities = ["Frankfurt", "Stuttgart", "Freiburg", "München", "Zürich", "Hamburg"]
populations = [736000, 628000, 228000, 1450000, 409241, 1841179]

The probability of a particular city being chosen depends on its population. Thus, larger the population of a city, higher the probability of the city being chosen. Based on this condition, find the probability distribution of the cities and display the city that might be selected in 10 attempts.

Solution:

import random
cities = ["Frankfurt", "Stuttgart", "Freiburg", "München", "Zürich", "Hamburg"]
populations = [736000, 628000, 228000, 1450000, 409241, 1841179]
distributions = [round(pop / sum(populations), 2) for pop in populations]
print(distributions)
for i in range(10):
    print(random.choices(cities, distributions)[0])

Output:

[0.14, 0.12, 0.04, 0.27, 0.08, 0.35]
Freiburg
Frankfurt
Zürich
Hamburg
Stuttgart
Frankfurt
München
Frankfurt
München
München

With that we come to the end of this tutorial. I hope it has helped you. Please subscribe and stay tuned for more interesting tutorials and solutions. Happy learning!

The post Sample a Random Number from a Probability Distribution in Python appeared first on Be on the Right Side of Change.

Factorials – NumPy, Scipy, Math, Python

Chris — Sun, 05 Jun 2022 10:07:00 +0000

Factorial Definition and Example

The factorial function n! calculates the number of permutations in a set.

Say you want to rank three soccer teams Manchester United, FC Barcelona, and FC Bayern München — how many possible rankings exist?

The answer is 3! = 3 x 2 x 1 = 6.

Practical Example: Say, there are 20 football teams in England’s premier league. Each team can possibly reach any of the 20 ranks at the end of the season. How many possible rankings exist in the premier league, given 20 fixed teams?

Figure: Example of three possible rankings of the football teams in England’s premier league.

In general, to calculate the factorial n!, you need to multiply all positive integer numbers that are smaller or equal to n.

For example, if you have 5 soccer teams, there are 5! = 5 x 4 x 3 x 2 x 1 = 120 different pairings.

There are many different ways to calculate the factorial function in Python easily, see alternatives below.

Feel free to watch my explainer video as you go through the article:

How to Calculate the Factorial in NumPy?

NumPy’s math module contains efficient implementations of basic math functions such as the factorial function numpy.math.factorial(n).

Here’s an example of how to calculate the factorial 3! with NumPy:

>>> import numpy as np
>>> np.math.factorial(3)
6

The factorial function in NumPy has only one integer argument n. If the argument is negative or not an integer, Python will raise a value error.

Here’s how you can calculate this in Python for 3 teams:

Exercise: Modify the code to calculate the number of rankings for 20 teams!

How to Calculate the Factorial in Scipy?

The popular scipy library is a collection of libraries and modules that help you with scientific computing.

Scipy contains a powerful collection of functionality—built upon the NumPy library. Thus, it doesn’t surprise that the SciPy factorial function scipy.math.factorial() is actually a reference to NumPy’s factorial function numpy.math.factorial().

In fact, if you compare their memory addresses using the keyword is, it turns out that both refer to the same function object:

>>> import scipy, numpy
>>> scipy.math.factorial(3)
6
>>> numpy.math.factorial(3)
6
>>> scipy.math.factorial is numpy.math.factorial
True

So you can use both scipy.math.factorial(3) and numpy.math.factorial(3) to compute the factorial function 3!.

As both functions point to the same object, the performance characteristics are the same — one is not faster than the other one.

Let’s have a look at math.factorial() — the mother of all factorial functions.

Check out my new Python book Python One-Liners (Amazon Link).

If you like one-liners, you’ll LOVE the book. It’ll teach you everything there is to know about a single line of Python code. But it’s also an introduction to computer science, data science, machine learning, and algorithms. The universe in a single line of Python!

The book was released in 2020 with the world-class programming book publisher NoStarch Press (San Francisco).

Publisher Link: https://nostarch.com/pythononeliners

How to Calculate the Factorial in Python’s Math Library?

As it turns out, not only NumPy and Scipy come with a packaged “implementation” of the factorial function, but also Python’s powerful math library.

You can use the math.factorial(n) function to compute the factorial n!.

Here’s an example:

>>> import math
>>> math.factorial(3)
6

The factorial of 3 is 6 — nothing new here.

Let’s check whether this is actually the same implementation as NumPy’s and Scipy’s factorial functions:

>>> import scipy, numpy, math
>>> scipy.math.factorial is math.factorial
True
>>> numpy.math.factorial is math.factorial
True

Ha! Both libraries NumPy and Scipy rely on the same factorial function of the math library.

Note: Hence, to save valuable space in your code, use the math factorial function if you have already imported the math library. If not, just use the NumPy or Scipy factorial function aliases.

So up ’till now we’ve seen the same old wine in three different bottles: NumPy, Scipy, and math libraries all refer to the same factorial function implementation.

How to Calculate the Factorial in Python?

It’s often a good idea to implement a function by yourself. This will help you understand the underlying details better and gives you confidence and expertise.

So let’s implement the factorial function in Python!

To calculate the number of permutations of a given set of n elements, you use the factorial function n!. The factorial is defined as follows:

n! = n × (n – 1) × ( n – 2) × . . . × 1

For example:

1! = 1
3! = 3 × 2 × 1 = 6
10! = 10 × 9 × 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 = 3,628,800
20! = 20 × 19 × 18 × . . . × 3 × 2 × 1 = 2,432,902,008,176,640,000

Recursively, the factorial function can also be defined as follows:

n! = n × (n – 1)!

The recursion base cases are defined as shown here:

1! = 0! = 1

The intuition behind these base cases is that a set with one element has one permutation, and a set with zero elements has one permutation (there is one way of assigning zero elements to zero buckets).

Now, we can use this recursive definition to calculate the factorial function in a recursive manner:

>>> factorial = lambda n: n * factorial(n-1) if n > 1 else 1
>>> factorial(3)
6

Try It Yourself: Run this one-liner in our interactive code shell:

Exercise: What’s the output?

The lambda keyword is used to define an anonymous function in a single line.

Learning Resource: You can learn everything you need to know about the lambda function in this comprehensive tutorial on the Finxter blog.

If you love one-liners like I do, check out my book “Python One-Liners” which will teach you everything there is to learn about a single line of Python code!

You create a lambda function with one argument n and assign the lambda function to the name factorial. Finally, you call the named function factorial(n-1) to calculate the result of the function call factorial(n).

Roughly speaking, you can use the simpler solution for factorial(n-1) to construct the solution of the harder problem factorial(n) by multiplying the former with the input argument n.

As soon as you reach the recursion base case n <= 1, you simply return the hard-coded solution factorial(1) = factorial(0) = 1.

An alternative is to use the iterative computation like this:

def factorial(n):
    fac = n
    for i in range(1, n):
        fac *= i
    return fac

print(factorial(3))
# 6

print(factorial(5))
# 120

In the function factorial(n), we initialize the variable fac to the value n. Then, we iterate over all values i between 1 and n-1 (inclusive) and multiply them with the value currently stored in the variable fac. The result is the factorial of the integer value n.

Speed Comparison

Let’s compare all three different ways to calculate the factorial function regarding speed.

Note that the NumPy, Scipy, and math factorial functions are referencing the same function object—they have the same speed properties.

Thus, we only compare the math.factorial() function with our two implementations in Python (recursive and iterative).

Want to take a guess first?

I used my own notebook computer (Quadcore, Intel Core i7, 8th Generation) with Python 3.7 to run 900 factorial computations for each method using the following code:

import time

num_runs = 900
speed = []


## SPEED TEST MATH.FACTORIAL ##
import math


start = time.time()
for i in range(num_runs):
    math.factorial(i)
stop = time.time()

speed.append(stop-start)

    
## SPEED TEST RECURSIVE ##
factorial = lambda n: n * factorial(n-1) if n > 1 else 1

start = time.time()
for i in range(num_runs):
    factorial(i)
stop = time.time()

speed.append(stop-start)

    
## SPEED TEST ITERATIVE ##
def factorial(n):
    fac = n
    for i in range(1, n):
        fac *= i
    return fac


start = time.time()
for i in range(num_runs):
    factorial(i)
stop = time.time()

speed.append(stop-start)


## RESULT
print(speed)
# [0.011027336120605469, 0.10074210166931152, 0.0559844970703125]
import matplotlib.pyplot as plt
plt.bar(["Math", "Recursive", "Iterative"], height=speed)
plt.show()

Wow—the clear winner is the math module! A clear sign that you should always prefer library code over your own implementations!

The math library’s implementation is almost 600% faster than the iterative one and 1000% faster than the recursive implementation.

Method	`math.factorial`	Recursive	Iterative
Seconds	0.01	0.10	0.05

Try It Yourself: You can perform this speed comparison yourself in the interactive code shell:

Exercise: Do you receive similar results in your browser? Run the shell to find out!

Where to Go From Here

The three library implementations numpy.math.factorial(), scipy.math.factorial(), and math.factorial() point to the same function object in memory—they are identical so use any of them.

One a higher-level, you’ve learned that library implementations of popular libraries such as NumPy are blazingly fast and efficient. Do yourself a favor and use library implementations wherever possible.

A good place to start is the NumPy library which is the basis of many more advanced data science and machine learning libraries in Python such as matplotlib, pandas, tensorflow, and scikit-learn. Learning NumPy will set the foundation on which you can build your Python career.

Tutorial: NumPy — Everything you need to know to get started

Programmer Humor

Q: How do you tell an introverted computer scientist from an extroverted computer scientist?

A: An extroverted computer scientist looks at your shoes when he talks to you.

The post Factorials – NumPy, Scipy, Math, Python appeared first on Be on the Right Side of Change.

Normal Distribution and Shapiro-Wilk Test in Python

Rebecca Nowack — Sat, 04 Jun 2022 07:05:42 +0000

Normal distribution is a statistical prerequisite for parametric tests like Pearson’s correlation, t-tests, and regression.

Testing for normal distribution can be done visually with sns.displot(x, kde=true).
The Shapiro-Wilk test for normality can be done quickest with pingouin‘s pg.normality(x).

Note: Several publications note that normal distribution is the least important prerequisite for parametric tests and with large sample sizes you can assume normal distribution. Check this paper for more details.

Python Libraries for Normal Distribution and Shapiro-Wilk

We import pingouin, seaborn and SciPy. SciPy is the standard package for statistical tests and pingouin is a package for quick one-line statistical tests.

import pandas as pd
import pingouin as pg
import seaborn as sns
import scipy as scipy

Method 1: Seaborn

We load the dataset about different species and sizes of penguins from seaborn.

penguins = sns.load_dataset('penguins')
penguins.head()

We’ll check out the bill length of the penguins more closely. With Seaborn, we can plot a distribution curve over our data.

A normal distribution will have the shape of the gaussian curve. That is why a distribution plot is a great way to determine normal distribution visually as it can be seen right away if it is a bell curve or not.

sns.displot(penguins["bill_length_mm"], kde=True)

Output:

This curve does not look normally distributed, but close.

The Shapiro-Wilk test is a test for normal distribution and can confirm our assumption.

The hypothesis for the test are:

H0: Our data is normally distributed.
H1: Our data is not normally distributed.

If the test is significant, we’ll have to reject H0, meaning that we assume H1 is true, and the data is not normally distributed.

Method 2: Shapiro-Wilk Test with Pingouin

With the package pingouin, we can have a quick test output. For instance, the function call pg.normality(x) will give us the results of the Shapiro-Wilk test while automatically dropping missing values.

Here’s an example for testing normality on the penguins dataset previously instantiated:

pg.normality(penguins["bill_length_mm"])

The p-value is significant, so we will reject the H0 assumption that our data is normally distributed and confirm our visual assumption of non-normal distribution.

Method 3: Shapiro-Wilk Test in SciPy

The Shapiro-Wilk test can also be done with scipy.stats.shapiro(x). However, SciPy does not automatically drop missing values so the test will be invalid. Therefore, we must drop them beforehand.

bill_length = penguins["bill_length_mm"].dropna()
scipy.stats.shapiro(bill_length)

Output:

This delivers the same results and confirms our assumption of a not normally distributed variable.

Normal Distribution on the Iris Dataset

A normal distributed variable would look more like the sepal width from the iris dataset:

iris = sns.load_dataset('iris')
sns.displot(iris["sepal_width"], kde=True)

Output:

pg.normality(iris["sepal_width"])

Output:

scipy.stats.shapiro(iris["sepal_width"])

Output:

Here, the Shapiro-Wilk test is not significant, so we assume H0 is correct and the data normally distributed.

If you want to apply parametric tests to your data like a Pearson regression you mostly still can, as normal distribution is not a hard prerequisite and large datasets tend to be normally distributed.

You can also z-transform and normalize your data so the values have the same mean and standard deviation. This is especially useful for machine learning algorithms.

Programmer Humor

Q: How do you tell an introverted computer scientist from an extroverted computer scientist?

A: An extroverted computer scientist looks at your shoes when he talks to you.

The post Normal Distribution and Shapiro-Wilk Test in Python appeared first on Be on the Right Side of Change.

Pearson Correlation in Python

Rebecca Nowack — Sat, 04 Jun 2022 06:36:21 +0000

A good solution to calculate Pearson’s r and the p-value, to report the significance of the correlation, in Python is scipy.stats.pearsonr(x, y). A nice overview of the results delivers pingouin’s pg.corr(x, y).

What is Pearson’s “r” Measure?

A statistical correlation with Pearson’s r measures the linear relationship between two numerical variables.

The correlation coefficient r tells us how the values lie on a descending or ascending line. r can take on values between 1 (positive correlation) and -1 (negative correlation) and 0 would be no correlation.

The prerequisite for the Pearson correlation is the normal distribution and metric data (e.g., measurements of height, distance, income, or age).

For categorical data you should use the Spearman Rho rank correlation.

However, the normal distribution is the least important prerequisite, and for larger datasets, parametric tests are robust so they can still be used. Larger datasets tend to be normally distributed but normality tests are sensitive to minor changes and reject the notion of normality on large datasets.

Note: Be aware not to mix causality and correlation. Two variables that correlate do not necessarily have a causal relationship. It could be a third variable missing that explains the correlation or it is just by chance. This is called a spurious relationship.

Python Libraries to Calculate Correlation Coefficient “r”

We will calculate the correlation coefficient r with several packages on the iris dataset.

First, we load the necessary packages.

import pandas as pd
import numpy as np
import pingouin as pg
import seaborn as sns
import scipy as scipy

Pearson Correlation in Seaborn

Many packages have built-in datasets. You can import iris from Seaborn.

iris = sns.load_dataset('iris')
iris.head()

Output:

With seaborn’s sns.heatmap() we can get a quick correlation matrix if we pass df.corr() into the function.

sns.heatmap(iris.corr())

Output:

This tells us that we have a high correlation between petal length and petal width, so we will test these variables separately.

First, we inspect the two variables with a seaborn sns.scatterplot() to visually determine a linear relationship.

sns.scatterplot(data=iris, x="petal_length", y="petal_width")

Output:

There is a clear linear relationship so we go on calculating our correlation coefficient.

Pearson Correlation in NumPy

NumPy will deliver the correlation coefficient Pearson’s r with np.corrcoef(x, y).

np.corrcoef(iris["petal_length"], iris["petal_width"])

Output:

Pearson Correlation in Pandas

Pandas also has a correlation function. With df.corr() you can get a correlation matrix for the whole dataframe. Or you can test the correlation between two variables with x.corr(y) like this:

iris["petal_length"].corr(iris["petal_width"])

Output:

Note: NumPy and pandas do not deliver p-values which is important if you want to report the findings. The following two solutions are better for this.

Pearson Correlation in SciPy

With scipy.stats.pearsonsr(x, y) we receive r just as quick and a p-value.

scipy.stats.pearsonr(iris["petal_length"], iris["petal_width"])

SciPy delivers just two values, but these are important: the first is the correlation coefficient r and the second is the p-value that determines significance.

Pearson Correlation in Pingouin

My favorite solution is the statistical package pingouin because it delivers all values you would need for interpretation.

If you’re not familiar with pingouin check it out! It has great functions for complete test statistics.

pg.corr(iris["petal_length"], iris["petal_width"])

Output:

The output tells us the number of cases n, the coefficient r, the confidence intervals, the p-value, the Bayes factor, and the power.

The power tells us the probability of detecting a true and strong relationship between variables. If the power is high, we are likely to detect a true effect.

Interpretation:

The most important values are the correlation coefficient r and the p-value. Pingouin also delivers some more useful test statistics.

If p < 0.05 we assume a significant test result.

r is 0.96 which is a highly positive correlation, when 1 is the maximum and a perfect correlation.

Based on r, we can determine the effect size which tells us the strength of the relationship by interpreting r after Cohen’s effect size interpretation. There are also other interpretations for the effect size but Cohen’s is widely used.

After Cohen, a value of r around 0.1 to 0.3 shows a weak relationship, from 0.3 on would be an average effect and from 0.5 upwards will be a strong effect. With r = 0.96 we interpret a strong relationship.

Programmer Humor

“Real programmers set the universal constants at the start such that the universe evolves to contain the disk with the data they want.” — xkcd

The post Pearson Correlation in Python appeared first on Be on the Right Side of Change.

How to Calculate z-scores in Python?

Rebecca Nowack — Sat, 28 May 2022 14:11:32 +0000

The z-scores can be used to compare data with different measurements and for normalization of data for machine learning algorithms and comparisons.

Note: There are different methods to calculate the z-score. The quickest and easiest one is: scipy.stats.zscore().

What is the z-score?

The z-score is used for normalization or standardization to make differently scaled variables with different means and categories comparable.

The formula for the z score is easy, so it is not a complicated transformation:

z-score = (datapoint – mean)/standard deviation

The statistical expression is

z = (X – μ) / σ

The z-score then tells us how far away the normalized value is from the standardized mean. The mean for the z-score will always be 0 and the variance and standard deviation will be 1. This way, the means of two differently scaled data points are comparable.

This is useful for different measurements of the same item for example comparing measurements like mm and inch or comparing test results with different max scores.

So we’ll actually try this on an example.

Example z-score

This term, Frank has reached 48, 33 and 41 points on the tests in math and 82, 98 and 75 points on the tests in English.

Question: Is Frank better in English than in math?

We don’t know because the max points in the math tests are 50 points and 100 for the English tests so we cannot directly compare these results.

But we can test our question with the z-score by normalizing and comparing the means.

First, we load our packages and create a data frame with the test results.

import pandas as pd
import NumPy as np
import scipy.stats as stats

test_scores = pd.DataFrame(
    {"math":[48, 33, 41],
     "english":[82, 98, 75]},
    index=[1, 2, 3])

The data frame with the test results look like this:

How to Calculate z-scores with Pandas?

To calculate the z-scores in pandas we just apply the formula to our data.

z_test_scores = (test_scores-test_scores.mean())/(test_scores.std())

We now normalized over each column and can tell for each test result how much it differs from the standardized mean.

z_test_scores.apply(stats.zscore)

Important: Pandas calculates the standard deviation per default with an unbiased standard estimator and NumPy does not. This can be adapted with the degree of freedom ddof=0 in pandas to equalize it to NumPy or ddof=1 in NumPy to use the unbiased estimator.

In pandas the default setting is the normalization by N-1 for the calculation of the standard deviation.

For NumPy and scipy.stats.zscore, which is based on NumPy, the default is 0, so N is the estimator.

Just be aware of where this difference comes from.

How to z-transform in Python with SciPy.Stats?

SciPy has the quickest function available in stats scipy.stats.zscore(data). We’ll use this on our test scores.

stats.zscore(test_scores)

This will standardize each column. The output shows slightly different values than in pandas.

Applying the zscore() function to a pandas data frame will deliver the same results.

z_test_scores.apply(stats.zscore)

If we adapt the delta degrees of freedom to N-1 equal to pandas, we receive the same results as above.

stats.zscore(test_scores, ddof=1)

Output:

To answer the question (in what subject Frank is better this term?) we use the mean of the scores and pass it into the same function.

stats.zscore(test_scores.mean())

This tells us that Frank was better in English than in math!

How to Calculate z-scores with NumPy?

The z-transformation in NumPy works similar to pandas.

First, we turn our data frame into a NumPy array and apply the same formula. We have to pass axis = 0 to receive the same results as with stats.zscores(), as the default direction in NumPy is different.

test_scores_np = test_scores.to_numpy()
z_test_scores_np = (test_scores_np - np.mean(test_scores_np, axis=0)) / np.std(test_scores_np, axis=0)

Output:

How to Calculate z-scores with sklearn Standard Scaler?

For normalization and standardization in machine learning algorithms, Scikit-learn also has a z-transform function called StandardScaler().

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()  
scaler.fit_transform(test_scores)

Output:

This will also return an array with the same values.

Summary

We now looked at four different ways to normalize data in Python with the z-score and one of them will surely work for you.

The post How to Calculate z-scores in Python? appeared first on Be on the Right Side of Change.

How to Calculate a Logistic Sigmoid Function in Python?

Shubham Sayon — Tue, 24 May 2022 20:25:40 +0000

Summary: You can caculate the logistic sigmoid function in Python using:

The Math Module: 1 / (1 + math.exp(-x))
The Numpy Library: 1 / (1 + np.exp(-x))
The Scipy Library: scipy.special.expit(x)

Problem: Given a logistic sigmoid function:

If the value of x is given, how will you calculate F(x) in Python? Let’s say x=0.458.

Note: Logistic sigmoid function is defined as (1/(1 + e^-x)) where x is the input variable and represents any real number. The function returns a value that lies within the range -1 and 1. It forms an S-shaped curve when plotted on a graph.

Method 1: Sigmoid Function in Python Using Math Module

Approach: Define a function that accepts x as an input and returns F(x) as 1/(1 + math.exp(-x)).

import math


def sigmoid(x):
    return 1 / (1 + math.exp(-x))


print(sigmoid(0.458))

# OUTPUT: 0.6125396134409151

Caution: The above solution is mainly intended as a simple one-to-one translation of the given sigmoid expression into Python code. It is not strictly tested or considered a perfect and numerically sound implementation.

If you need a more robust implementation, some of the solutions to follow might be more instrumental in solving your case.

Here’s a more stable implementation of the above solution:

import math


def sigmoid(x):
    if x >= 0:
        k = math.exp(-x)
        res = 1 / (1 + k)
        return res
    else:
        k = math.exp(x)
        res = k / (1 + k)
        return res


print(sigmoid(0.458))

Note: exp() is a method of the math module in Python that returns the value of E raised to the power of x. Here, x is the input value passed to the exp() function, while E represents the base of the natural system of the logarithm (approximately 2.718282).

Method 2: Sigmoid Function in Python Using Numpy

The sigmoid function can also be implemented using the exp() method of the Numpy module. numpy.exp() works just like the math.exp() method, with the additional advantage of being able to handle arrays along with integers and float values.

Example 1: Let’s have a look at an example to visualize how to implement the sigmoid function using numpy.exp():

import numpy as np


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


print(sigmoid(0.458))

# OUTPUT: 0.6125396134409151

Probably a more numerically stable version of the above implementation is as follows:

import numpy as np


def sigmoid(x):
    return np.where(x < 0, np.exp(x) / (1 + np.exp(x)), 1 / (1 + np.exp(-x)))


print(sigmoid(0.458))

# OUTPUT: 0.6125396134409151

Example 2: Let’s look at an implementation of the sigmoid function upon an array of evenly spaced values with the help of a graph.

import numpy as np
import matplotlib.pyplot as plt


def sigmoid(x):
    return np.where(x < 0, np.exp(x) / (1 + np.exp(x)), 1 / (1 + np.exp(-x)))


val = np.linspace(start=-10, stop=10, num=200)
sigmoid_values = sigmoid(val)
plt.plot(val, sigmoid_values)
plt.xlabel("x")
plt.ylabel("sigmoid(X)")
plt.show()

Output:

Explanation:

Initially, we created an array of evenly spaced values within the range of -10 and 10 with the help of the linspace method of the Numpy module, i.e., val.
We then used the sigmoid function on these values. If you print them out, you will find that they are either extremely close to 0 or very close to 1. This can also be visualized once the graph is plotted.
Finally, we plotted the sigmoid function graph that we previously computed with the help of the function. The x-axis maps the values contained in val, while the y-axis maps the values returned by the sigmoid function.

Do you want to become a NumPy master? Check out our interactive puzzle book Coffee Break NumPy and boost your data science skills! (Amazon link opens in new tab.)

Method 3: Sigmoid Function in Python Using the Scipy Library

Another efficient way to calculate the sigmoid function in Python is to use the Scipy libraries expit function.

Example 1: Calculating logistic sigmoid for a given value

from scipy.special import expit
print(expit(0.458))

# OUTPUT: 0.6125396134409151

Example 2: Calculating logistic sigmoid for multiple values

from scipy.special import expit
x = [-2, -1, 0, 1, 2]
for value in expit(x):
    print(value)

Output:

0.11920292202211755
0.2689414213699951
0.5
0.7310585786300049
0.8807970779778823

Recommended Read: Logistic Regression in Python Scikit-Learn

Method 4: Transform the tanh() Function

Another workaround to compute the sigmoid function is to transform the tanh function of the math module as shown below:

import math

sigmoid = lambda x: .5 * (math.tanh(.5 * x) + 1)
print(sigmoid(0.458))

# OUTPUT: 0.6125396134409151

Since, mathematically sigmoid(x) == (1 + tanh(x/2))/2. Hence, the above implementation should work and is a valid solution. However, the methods mentioned earlier are undoubtedly more stable numerically and superior to this solution.

How to Calculate the Sigmoid for Arrays with Size Bigger Than 1 in Python?

You can calculate the sigmoid function for 2D arrays (and even higher dimensional arrays) using NumPy. The NumPy library applies operations element-wise, so the shape of the array does not affect the ability to apply the sigmoid function.

Here’s an example of a 2D array:

import numpy as np

# Define the sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Create a 2D array
arr = np.array([[0, 1, 2], [3, 4, 5]])

sigmoid_arr = sigmoid(arr)

print(sigmoid_arr)

In this example, sigmoid_arr will be a 2D array with the same shape as arr, but with the sigmoid function applied to each element.

The output will be:

[[0.5        0.73105858 0.88079708]
 [0.95257413 0.98201379 0.99330715]]

These numbers are the sigmoid values of the corresponding elements in the input array. Each number is between 0 and 1, inclusive, as is the property of the sigmoid function.

Conclusion

Well, that’s it for this tutorial. We have discussed as many as four ways of calculating the logistic sigmoid function in Python. Feel free to use the one that suits your requirements.

I hope this article has helped you. Please subscribe and stay tuned for more interesting solutions and tutorials. Happy learning!

TensorFlow – A Hands-On Introduction to Deep Learning and Neural Networks for Beginners

This course gives you a charming introduction into deep learning and neural networks using Google’s TensorFlow library for Python beginners.

The post How to Calculate a Logistic Sigmoid Function in Python? appeared first on Be on the Right Side of Change.

How to Install SciPy on PyCharm?

Chris — Mon, 13 Sep 2021 08:27:22 +0000

SciPy is an open-source Python library for math, science, and engineering. It includes the wildly popular NumPy and Matplotlib libraries.

Problem Formulation: Given a PyCharm project. How to install the SciPy library in your project within a virtual environment or globally?

Here’s a solution that always works:

Open File > Settings > Project from the PyCharm menu.
Select your current project.
Click the Python Interpreter tab within your project tab.
Click the small + symbol to add a new library to the project.
Now type in the library to be installed, in your example "scipy" without quotes, and click Install Package.
Wait for the installation to terminate and close all popup windows.

Here’s the installation process as a short animated video—it works analogously for SciPy, just type in “scipy” in the search field instead:

Make sure to select only “scipy” because there are many other packages that are not required but also contain the term “scipy” (False positives):

Alternatively, you can run the pip install scipy command in your PyCharm “Terminal” view:

$ pip install scipy

Feel free to check out the following free email academy with Python cheat sheets to boost your coding skills!

To become a PyCharm master, check out our full course on the Finxter Computer Science Academy available for free for all Finxter Premium Members:

The post How to Install SciPy on PyCharm? appeared first on Be on the Right Side of Change.

Python – Inverse of Normal Cumulative Distribution Function (CDF)

Chris — Tue, 03 Aug 2021 19:22:31 +0000

Problem Formulation

How to calculate the inverse of the normal cumulative distribution function (CDF) in Python?

Method 1: scipy.stats.norm.ppf()

In Excel, NORMSINV is the inverse of the CDF of the standard normal distribution.

In Python’s SciPy library, the ppf() method of the scipy.stats.norm object is the percent point function, which is another name for the quantile function. This ppf() method is the inverse of the cdf() function in SciPy.

norm.cdf() is the inverse function of norm.ppf()
norm.ppf() is the inverse function of norm.cdf()

You can see this in the following code snippet:

from scipy.stats import norm

print(norm.cdf(norm.ppf(0.5)))
print(norm.ppf(norm.cdf(0.5)))

The output is as follows:

0.5
0.5000000000000001

An alternative is given next:

Method 2: statistics.NormalDist.inv_cdf()

Python 3.8 provides the NormalDist object as part of the statistics module that is included in the standard library. It includes the inverse cumulative distribution function inv_cdf(). To use it, pass the mean (mu) and standard deviation (sigma) into the NormalDist() constructor to adapt it to the concrete normal distribution at hand.

Have a look at the following code:

from statistics import NormalDist

res = NormalDist(mu=1, sigma=0.5).inv_cdf(0.5)
print(res)
# 1.0

A great resource on the topic is given next.

References:

https://stackoverflow.com/questions/20626994/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p

Do you want to become a NumPy master? Check out our interactive puzzle book Coffee Break NumPy and boost your data science skills! (Amazon link opens in new tab.)

The post Python – Inverse of Normal Cumulative Distribution Function (CDF) appeared first on Be on the Right Side of Change.

Best 10 Scipy Cheat Sheets

Amber Mercado — Fri, 29 Jan 2021 15:48:34 +0000

Hey Finxters! Another 10 of the best cheat sheets is here for you to peruse and hang on your wall with your other Python cheat sheets on the wall! Today, we are going to browse cheat sheets for Scipy!! For a quick explanation, SciPy is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific Python. It provides more utility functions for optimization, stats and signal processing. Now that we have a brief explanation on what it is, let us dive right into these cheat sheets that can be kept handy when learning to implement Scipy in Python!

Cheat Sheet 1: DataCamp

The first cheat sheet is from DataCamp.com and is chock full of information for you to consume. You will learn to interact with Numpy and know which functions and methods to use for linear algebra and of course a help section. This is one I would hang behind my monitor behind the wall!

Pros: Rated ‘E’ for everyone.

Cons: None that I can see.

Cheat Sheet 2: Quandl

This cheat sheet covers the three main data science libraries used in Python: Pandas, Numpy, and Scipy. It goes over the functions call but has explanations on each one. Near the end it shows how to import data sets for you to use! Great for a beginner project!

Pros: Rated ‘E’ for everyone. Bonus Python project included!

Cons: None that I can see.

Cheat Sheet 3: Elite Data Science

This cheat sheet will walk you through some of the most common and useful functionality from these libraries. From importing data to a taste of Machine learning you can get a feel of what Python can do of the code examples.

Pros: Rated ‘E’ for everyone.

Cons: None that I can see.

Cheat Sheet 4: Cheatography

If you ever needed help understanding how to test a hypothesis in Scipy using code examples and clear explanations on what is happening when you write the code.

Pros: Rated ‘E’ for everyone.

Cons: None that I can see.

Cheat Sheet 5: Intellipaat

This cheat sheet is more a tutorial from Intellipaat.com It has full explanations with code examples to work. It has sufficient information about the scientific and technical library in Python, that is, Scipy. Nonetheless, it is more than worth your time to investigate and learn Scipy.

Pros: Rated ‘E’ for everyone.

Cons: It is more a tutorial than a cheat sheet.

Cheat Sheet 6: Scipy.org

From the mouth of Scipy, this cheat sheet will show you all of the methods needed to perform different functions in Scipy and Python with explanations. This Comprehensive list has everything sorted neatly into the different functions to make it easy to look up as you are working in Scipy. This is one you will want in your notebook on the desk as an easy reference guide.

Pros: Rated ‘E’ for everyone. Recommended for the wall or notebook for daily use!

Cons: None that I can see.

Cheat Sheet 7: Packt>

This is more a book than it is a cheat sheet. It focuses hard on mastering scipy giving you a project to work through so you can really get a grasp on Scipy and how it is implemented in Python. I recommend subscribing to the website for all of the information you will receive.

Pros: Rated ‘E’ for everyone.

Cons: It is an ebook not a cheat sheet, but worth your time.

Cheat Sheet 8: Scipy.org

This is another ebook that I recommend keeping on hand to learn Scipy from beginner levels to advanced. This book contains code for you to work on in order to learn scipy in python building your skills. This is important for you to learn the skill you need for your data science career. I suggest reading the book, highlight the parts you don’t understand and print the code example to pin to the wall for help and minimize searching.

Pros: Rated ‘E’ for everyone.

Cons: This is an ebook, but one of the best ways to learn.

Cheat Sheet 9: Packt>

This one is also a ebook from packt>. This ebook will teach you numerical and scientific computing in Python. You will also learn how to use Scipy in signal processing and how applications of Scipy can be used to collect, organize, analyze,a dn interpret data. By the end of the book, you will have fast, accurate, and easy-to-code solutions for numerical and scientific computing applications.

Pros: Rated ‘E’ for everyone.

Cons: This is an ebook so you will be spending time reading and coding.

Cheat Sheet 10: Packt>

Recipes are great in that you can find the exact one you are looking for without having to wade through all the other code snippets you do not need. In this ebook, you can play around with each one of these codes and gain a hands-on understanding of Scipy and its real-world problem applications.

Pros: Rated ‘E’ for everyone. The independent nature of the recipes allows you to hop around from each example making this book very versatile.

Cons: It is an ebook but a great one if you want to practice the different stacks of Scipy in Python.

Programmer Humor – Blockchain

“Blockchains are like grappling hooks, in that it’s extremely cool when you encounter a problem for which they’re the right solution, but it happens way too rarely in real life.” source – xkcd

Related Articles:

The post Best 10 Scipy Cheat Sheets appeared first on Be on the Right Side of Change.