Luis Bruemmer, Author at Be on the Right Side of Change

JavaScript Primitive Data Types

Luis Bruemmer — Fri, 04 Mar 2022 20:05:15 +0000

In this tutorial, we will learn everything about JavaScript DataTypes. We will get to know the different kinds of data types, how to determine data types, and how to convert them into other data types.

In short, JavaScript has seven primitive data types, namely: String, Number, BigInt, Boolean, undefined, null, and Symbol.

Besides that, there are JavaScript objects, but they will be part of another JavaScript tutorial. This tutorial will focus on primitive data types exclusively.

We determine the data type of a variable with the typeof operator.

And since JavaScript is a dynamically typed language, we do not have to specify the data type of a newly created variable.

Also, data types are automatically converted when we reassign a variable to a new value of a different data type.

Great documentation about JavaScript data types can be found here.

This is part of our Learn JavaScript series:

Primitive Data Types

Primitive data types are types that are already built into the programming language. They are the most basic form of data type, and they are very different from each other. JavaScript treats variables differently depending on the data type these variables have.

There are seven primitive data types which we will now discuss one after the other.

String

A string is just a series of characters used to represent some form of text.

Let’s have a look at an example:

let word1 = "Hello!";

We create the variable word1 using the keyword let. We assign this variable the value "Hello!" which is a string. We put the content of the text inside double-quotes.

Alternatively, we can use single quotes as well:

let word2 = 'House';

Whether we use double quotes or single quotes is a matter of taste. However, when we have apostrophes in our text, it’s necessary to use double quotes:

let sentence = "It's nice to meet you!";

because this would not work:

let sentence = 'It's nice to meet you!';

The computer thinks that the string ends after "It" and does not know what to do with the rest of the line of code.

Thus, if we create our string with double quotes, we cannot use double quotes within the string, and if we create our string with single quotes, we cannot use single quotes within the string.

Number

The Number data type is one of two numeric data types that JavaScript provides us with.

A number can be an integer or a decimal:

let x = 5;
let y = 2.1;

Here, we declare two variables x and y. Variable x is assigned the integer value 5 and y is assigned the decimal value 2.1.

With numbers, we can perform arithmetic operations, like addition, subtraction, multiplication, division, etc.

let z = x + y;
console.log(z);
>>> 7.1

We declare the variable z and assign it the sum of x and y. When we output z with console.log(), we can see that the sum of x and y is 7.1.

BigInt

The other numeric data type is BigInt which can represent integer values of arbitrary length. The Number data type, however, has limitations.

We create BigInt numbers by adding the letter n to the end of an integer value:

let a = 43158490244031231567560143649357809134n;
console.log(a);
>>> 43158490244031231567560143649357809134n

We create this huge integer number and add the letter n as a suffix. When we output the value of the variable that we assigned this value, the output shows the exact same number.

When we do the same without an n, this is what we get:

let b = 43158490244031231567560143649357809134;
console.log(b);
>>> 4.315849024403123e+37

This way, the output is much less precise than before.

Boolean

A Boolean can have two values: true or false.

let booleanValue1 = true;
let booleanValue2 = false;

The first variable has the value true and the second one has the value false. We can use boolean expressions in if statements or in loops.

For example:

if (booleanValue2) {
    console.log("Hey guys!");
}
>>>

As we can see, this code snippet does not produce any output.

The if statement checks if the statement inside the parenthesis is true or false. Since booleanValue2 is set to false, the console.log() statement is not executed.

Let’s have a look at another example:

if (booleanValue1) {
    console.log("Hey guys!");
}
>>> Hey guys!

The only thing we changed here was we replaced booleanValue2 with booleanValue1. Since booleanValue1 is true, the console.log() statement will be executed and we get an output.

Undefined

When we declare a variable and we do not assign it a value, the value of this variable is undefined:

let d;
console.log(d);
>>> undefined

When we assign this variable for example a numeric value it is not undefined anymore:

d = 4;
console.log(d);
>>> 4

However, we can set the variable back to undefined:

d = undefined;
console.log(d);
>>> undefined

Null

The null value in JavaScript represents a value that is nonexistent.

let n = null;
console.log(n);
>>> null

One might think that null is the same as undefined.

But there are differences between these two. We must explicitly set a variable to null to make it null, whereas a variable is automatically set to undefined when we do not assign it a value.

So, null values are nonexistent and undefined values are not yet existent.

Symbol

Symbols can be used as unique values. They are created with the Symbol() function:

let symbol1 = Symbol("Symbol");
let symbol2 = Symbol("Symbol");

Here, we create two symbols: symbol1 and symbol2. Both symbols contain the same description ("Symbol").

Let’s check if they are unique. We achieve that with the strict equality operator ===:

console.log(symbol1 === symbol2);
>>> false

When we compare the symbols with the strict equality operator, we get the result false. That shows us that these two symbols are unique.

This is where Symbols differ from strings because if we do the same with strings, this happens:

let string1 = "Hi!";
let string2 = "Hi!";
console.log(string1 === string2);
>>> true

We create two identical strings. When we compare them with the strict equality operator, the result is true since these strings are actually identical.

Determine Data Type with typeof Operator

We use the typeof operator to determine the data type of a variable. This is useful when we don’t know what data type a certain variable has.

let x1 = 5;
console.log(typeof (x1));
>>> number

We assign the newly declared variable x1 the number 5.

Then we check the type of x1 with the typeof operator and put it inside a console.log() statement to output the type.

The output says that the type of x1 is number.

Data Type Conversion

Since JavaScript is a dynamic language, we do not have to specify the type of a variable when declaring it:

let v = 3;
console.log(typeof(v));
>>> number

We declare the variable v and assign it the value 3. When we output the type of this variable, we can see that the type of v is number although we never specified that.

Due to these dynamic features, we can reassign the variable v a value of another type easily:

v = "Hello";
console.log(typeof(v));
>>> string

We assign the variable the string "Hello". And when we output the type of v now, we can see that we successfully changed the type to string.

Summary

All in all, we learned about the different data types that JavaScript provides us with. We learned the data type’s characteristics, how to determine the data type of a variable, and how to reassign variables to values of another data type.

If you wish to learn more about JavaScript, stay tuned for the other tutorials that are being released to Finxter.

And for more tutorials about other computer and data science-related topics, check out the Finxter email academy!

Happy Coding!

The post JavaScript Primitive Data Types appeared first on Be on the Right Side of Change.

JavaScript Assignment and Arithmetic Operators

Luis Bruemmer — Sat, 26 Feb 2022 16:50:28 +0000

In this JavaScript tutorial, we will learn all about the assignment and arithmetic operators that we can use when programming with JavaScript.

In short, with the arithmetic operators, we can perform arithmetic operations like addition, subtraction, multiplication, division, remainder, and exponentiation.

We use the assignment operators to assign values to variables.

Great documentations about JavaScript operators can be found here and here.

This is part of our Learn JavaScript series:

Arithmetic Operators

We will start with the arithmetic operators. As stated in the introduction, these operators are used to perform arithmetic operations.

We create two initial variables and give each a numerical value:

let x = 2;
let y = 7;

So, x has the value 2 and y has the value 7. We will work with these values throughout this section.

We also create the variable z which is supposed to hold the resulting values:

let z;

We only need to declare the variables once with the keyword let. For the following examples, we can just reassign them to other values.

Now, we can perform the actual operations. We start with addition:

Addition Operator

z = x + y;

We assign the variable z the addition of x and y. x and y are the operands and + is the operator.

We will now output z to see what value it has:

console.log(z);
// 9

As we can see, we successfully added x and y since 2+7 is indeed 9.

Likewise, we can perform the other arithmetic operations like

Subtraction Operator

z = y - x;
console.log(z);
// 5

Multiplication Operator

z = y * x;
console.log(z);
// 14

Division Operator

z = y / x;
console.log(z);
// 3.5

So, the difference between these statements is the arithmetic operator. In addition, we use +, in subtraction, we use -, in multiplication, we use *, and in division, we use /.

There are two remaining arithmetic operators: exponentiation and remainder.

Exponentiation Operator

The exponentiation operator raises the first operand to the power of the second one:

z = y ** x;
console.log(z);
// 49

So, we calculate 7² which is 49.

Remainder Operator

The last operator, the remainder operator, returns this value:

z = y % x;
console.log(z);
// 1

This operator is used to calculate the remaining value after a division.

The example says the following: 7 / 2 = 3 and the remainder is 1. So, the 2 fits into the 7 three whole times and there is 1 remaining. The remainder operation only outputs the remainder itself.

Maybe you have come across the remainder operator under its other term “modu l o“.

Increment and Decrement

For incrementing and decrementing, JavaScript provides us with the operators of the same name. But there is a difference between using them as prefixes or postfixes.

Let’s create a new variable x:

let x = 5;

Now, we perform a postfix incrementation of x and assign this value to a newly created variable y:

let y = x++;

Let’s see what value y has:

console.log(y);
// 5

x has the following value:

console.log(x);
// 6

So, with the postfix incrementation, y was not affected, but x was incremented by one.

Let’s do the same again, but this time with prefix incrementation:

x = 5;
y = ++x;

We set x back to 5 and we perform a prefix incrementation of x and assign this value to y.

y now has this value:

console.log(y);
// 6

And x has this value:

console.log(x);
// 6

As we can see, y was also affected by the incrementation because with prefix incrementation the values for both x and y were incremented by 1.

Postfix Decrementation Operator

The same applies to decrementing. We start with postfix decrementation:

x = 5;
y = x--;

We set x back to 5 and we perform a postfix decrementation that we assign to y.

y has this value:

console.log(y);
// 5

And x has this value:

console.log(x);
// 4

So, y was not affected.

Prefix Decrementation Operator

Whereas, with the prefix decrementation, it looks like this:

x = 5;
y = --x;

x is again set back to 5 and we perform a prefix decrementation that we set to y.

y has this value:

console.log(y);
// 4

And x has this value:

console.log(x);
// 4

Thus, y is also affected by the prefix decrementation as was the case with prefix incrementation.

Assignment operators

We use assignment operators to assign values to variables.

Simple Assignment Operator

The equal sign = is known as the assignment operator:

let a = 4;

Here, we create the variable a, and we assign it the numeric value 4.

Addition Assignment Operator

We can also use the assignment operator to add a value to a variable:

a += 1;

Here, we assign the variable a the new value which is the initial value plus 1. Let’s check what value a now has:

console.log(a);
// 5

a has the value 5 because it was initially 4 and we added 1 to it.

The statement

a += 1;

is the same as

a = a + 1;

because we set the variable equal to the initial variable’s value and add 1.

Subtraction Assignment Operator

With subtraction, it looks like similar:

a = 4;
a -= 1;
console.log(a);
// 3

First, we set a back to 4. Then we subtract 1 from a and assign the new value to a. Then we output the value of a which is 3. And as with addition, the statement

a -= 1;

is the same as

a = a - 1;

Multiplication Assignment Operator

Multiplication looks like this:

a = 4;
a *= 2;
console.log(a);
// 8

Again, we set a back to the value 4. Then, we multiply a by 2 and add the new value to a. Next, we output the new value of a and we can see that it is now 8.

Division Assignment Operator

Division also works like this:

a = 4;
a /= 2;
console.log(a);
// 2

After setting a back to 4, we divide it by 2 and assign the new value to a. The output shows that we did this successfully as we get 2 as the output which is the result of dividing 4 by 2.

Remainder Assignment Operator

We do the same with the remainder:

a = 4;
a %= 2;
console.log(a);
// 0

We set a back to 4 and then we calculate the remainder of dividing 4 by 2 and set this value to a.

Since 2 fits into 4 two times and there is no remaining value, a has the value 0 which is confirmed by outputting a.

Exponentiation Assignment Operator

The exponentiation assignment works the same way:

a = 4;
a **= 3;
console.log(a);
// 64

a is set back to 4. Afterward, we calculate 4 to the power of 3 and assign the new value to the variable a. The output is 64 because 4³ equals 64.

The remaining assignment operators are a bit different from the ones we have seen by now.

Left-Shift Assignment Operator

We will start with the left shift assignment:

a = 5;
a <<= 2;
console.log(a);
// 20

So, what is happening here? We set the variable a that we already declared to the value 5. Then we perform the left shift. In this case, we move the bits 2 to the left and assign the new value to our variable a.

The 2-bit representation of the number 5 is 00101 because 1*2² + 1*2⁰ = 5.

And we shift these bits 2 to the left, so now we have got this 2-bit number: 10100 which is 1*2⁴ + 1*2² = 20.

Right-Shift Assignment Operator

Similarly, we can perform a right shift assignment:

a = 9;
a >>= 2;
console.log(a);
// 2

We set a to 9. Then, we perform the right shift assignment where we shift the bits 2 to the right. The 2-bit representation of 9 is: 01001.

When we shift the bits 2 to the right, we get 00010 which is 2. The other 1 is outside the scope and is therefore not taken into account.

The remaining assignment operators are bitwise assignment operators.

Bitwise AND Assignment Operator

The first one is the bitwise AND assignment operator:

a = 5;
a &= 2;
console.log(a);
// 0

a is set to 5. Then we perform a bitwise and operation of 5 and 2 and the result is then set as the new value for a. The output value of a is 0.

The 2-bit representation of 5 is 00101 and the 2-bit representation of 2 is 00010. When we put these 2-bit numbers above each other and perform a bitwise and operation, we get 00000 which is 0.

Bitwise OR Assignment Operator

Likewise, we do the same for the bitwise OR assignment operation.

a = 5;
a |= 4;
console.log(a);
// 5

We set a back to 5 again and then we perform a bitwise or operation of 5 and 4 and the result is assigned to a. The result is 5.

The 2-bit representation of 5 is 00101 and the 2-bit representation of 4 is 00100. When we put these 2-bit numbers above each other and perform a bitwise OR operation, we get 00101 which is 5.

Bitwise XOR Assignment Operator

There is also a bitwise XOR assignment operation:

a = 5;
a ^= 4;
console.log(a);
// 1

a is set back to 5 again. Then we perform the bitwise XOR operation of 5 and 4 and we assign the result to a.

The result of a bitwise xor operation of 5 and 4 is binary 00001 which is decimal 1.

Summary

In this tutorial, we learned all about JavaScript’s assignment and arithmetic operators. We learned how to perform different kinds of arithmetic operations, and how to assign values in different ways.

If you wish to learn more about JavaScript, stay tuned for the other tutorials that are being released to Finxter.

And for more tutorials about other computer and data science-related topics, check out the Finxter email academy!

Happy Coding!

The post JavaScript Assignment and Arithmetic Operators appeared first on Be on the Right Side of Change.

JavaScript: Syntax, Statements, Variables, and Comments

Luis Bruemmer — Sun, 13 Feb 2022 12:15:34 +0000

In this JavaScript tutorial, we will learn about the language’s syntax and statements, how to declare variables, and how to create comments.

In short, statements are separated by semicolons (";"), but they do not have to when the statements are in separate lines.
Variables are mostly declared with either the keyword “const” or “let“.
Regarding comments, there are two ways to produce comments: one-line comments ("//") and multi-line comments “(/* * */“).

A great reference to read more about these topics can be found here.

Syntax and statements

The syntax of a programming language is a set of rules that describe how a program should be structured. We have got a sequence of instructions that we can execute.

In JavaScript, these instructions are called statements.

Let’s have a look at an example statement:

let x = 42;

Here, we declare the variable “x” with the keyword “let” and assign it the value “42“.

Note: If you do not understand what “let“, “x“, and “42” mean here, don’t worry. We will look at variables in more detail in the next section.

The important thing here is the semicolon (“;“) at the end of the code line.

A semicolon determines the end of a statement.

However, if there is only one statement written in one code line, we do not necessarily have to set the semicolon:

let x = 42

That being said, if we put multiple statements into one line of code, the statements have to be separated by semicolons:

let x = 42; let y = 0;

Variables

As mentioned in the previous section, we will now shift our focus towards variables.

A variable is used to store data.

The name of a variable can be freely chosen but it has to start with a letter, an underscore, or a dollar sign.

Since JavaScript is case-sensitive, a variable called “jeff” is not the same as “Jeff” because the second “Jeff” starts with a capital letter and the other one does not.

In JavaScript there are four ways to declare variables:

with the keywords “var“,
“let“,
“const“, or
with no keyword at all.

When we declare a variable with no keyword, it looks like this:

a = 10;

The name of the variable here is “a” and we assign it the number “10” using the assignment operator “=“. However, using no keyword at all can lead to errors and should be avoided.

That’s why we should always use a keyword for the variable declaration.

One keyword is “var“:

var b = "Hello";

The “var” keyword is the original way to declare a variable in JavaScript. Using this keyword, we create function-scoped variables. Here, we declare a variable called “b” and assign it the string "Hello".

We can check the current value for “b” by outputting the value of the variable:

console.log(b);

Result: Hello

We use the “console.log()” statement for producing output. And as we can see, the value of “b” is "Hello".

However, when using “var” we can change the value of an existing variable.

b = 40;

We reassign the value of “b” to the number “40“. Let’s see what the output is now:

console.log(b);

Result: 40

We successfully reassigned the variable “b” to another value.

Also, when declaring a variable with “var“, we do not immediately have to set a value for that variable:

var c;

When we output “c“, this is what happens:

console.log(c);

Output: undefined

We get “undefined” as output because the variable does not hold any value yet. But we can, of course, give that variable a value now:

c = 5;
console.log(c);
// Output: 5

There are two remaining ways to declare a variable: “const” and “let“. These keywords were added to JavaScript a couple of years ago and they declare block-scoped variables.

Declaring a variable with “let” seems to be the same as declaring with “var“:

let d = true;
console.log(d);
// true

Here, we initially gave the new variable “d” the boolean value “true“. And as with “var“, we can change this variable’s value to another value:

d = false;
console.log(d);
// false

We can also declare a variable without an initial value:

let e;
console.log(e);
// undefined

So, we declare a variable that we optionally give a value and that we can change later just as we did with “var“.

The remaining keyword “const” is used to declare a constant:

const pi = 3.14;
console.log(pi);
// 3.14

We declare a variable called “pi” and give it the value “3.14“.

Compared to “var” and “let” we are not able to reassign another value to the variable because it is a constant.

In addition to that, we immediately have to give a “const” variable a value. We cannot declare it without a value.

When to use which variable declaration

We learned different ways to declare variables in JavaScript. So, when should we use which type of declaration?

A general rule of thumb is to always declare variables with the keyword “const” first. And if we think a variable’s value should be able to change and be reassigned, we declare this variable with “let“.

But why not with “var“?

“var” and “let” seem to work the same way. However, the scoping behavior of “var” can lead to errors which is why “let” and “const” were created. Thus, it is not necessary to use “var” at all. Any variable declaration can be done with either “const” or “let“.

Comments

We always want to document what our code is doing and why our code is doing what it does. Comments are great for explaining our code and they are mandatory to use in any code project, especially in large ones.

We can insert comments anywhere in our code and that does not change our program in any way. They are just there for making the code easier to understand.

In JavaScript, we have got two types of comments:

one-line comments and
multi-line comments.

One-line comments start with two slashes:

//This is a one-line comment

As we stated above, comments can be inserted anywhere in our code:

const pi = 3.14; // declare variable "pi" as constant

The comment does not change the code in any way.

Multi-line comments, as the name suggests, extend over several lines.

They can be used to create a comment block where we explain something in more detail. They start like this “/*” and end like this “*/” and anything that stands between that will be treated as a comment:

/*
The following code will
calculate some values and 
assign these values to 
new variables.
*/

Summary

All in all, we learned the basics about the programming language JavaScript in this article. We learned about statements, how to declare variables, and how to apply comments to our code.

For more tutorials about other computer and data science-related topics, check out the Finxter email academy!

Happy Coding!

Related Tutorial:

TypeScript Developer — Income and Opportunity

The post JavaScript: Syntax, Statements, Variables, and Comments appeared first on Be on the Right Side of Change.

[JavaScript Intro] How to See Your Code Output?

Luis Bruemmer — Sun, 06 Feb 2022 10:54:11 +0000

In this article, we will get an overview of the programming language JavaScript and we will learn how to see our code output and results when working with JavaScript.

In short, the easiest and most convenient way to see our JavaScript output is to use the console.log() method which shows a message in our web console:

console.log('Hello, world!');

A great JavaScript documentation can be found h e re.

This article is the first part of a JavaScript series here on Finxter. Feel free to check out the other articles after going through this one!

Feel free to skip ahead to the different ways to see your output in JavaScript—we’ll start slowly with some introductory material next.

What is JavaScript?

JavaScript is a programming language that was created in 1995 to make websites interactive and more dynamic. Together with HTML and CSS, it defines one of the three core technologies for the front-end of the web. If you used any kind of website where you interacted with the website in some way, you definitely came across JavaScript, maybe without knowing it.

So, why should you learn JavaScript? According to the “Stack Overflow Developer Survey 2021” JavaScript continues to be the most commonly used programming language.

Thus, although the language was created almost 30 years ago, it is still very much in demand.

There are countless libraries and frameworks built on top of JavaScript, such as React.js or Vue.js.

Hence, there are plenty of options to dive into after learning the language itself.

JavaScript and HTML

As we already know, JavaScript is mainly used for web development. So, when we are working on a web project, we have an HTML file that looks something like this:

My homepage

Hello, there!

A paragraph.

This is not an HTML tutorial. However, we will briefly go through what’s happening here.

The HTML file starts with a document type declaration which tells the browser what kind of document to expect.

All the actual information about our website is nested in the tag.

The tag contains the charset tag which specifies the character encoding.

It also contains the </code> tag which is the title of our homepage and can be seen in the browser’s title bar. </p> <p class="wp-block-paragraph">Then comes the <code><body></code> tag where we find the actual content of our webpage. Here, we can see a header tag <code><h1></code>, a paragraph tag <code><p></code> with the ID <code>p1</code>, and the <code><script></code> tag.</p> <p class="wp-block-paragraph">If we applied some form of CSS, we would link it in the “<code><head></code>” tag, but we will not do that here since it would be a bit too much for this tutorial.</p> <p class="wp-block-paragraph">The really interesting part of the HTML file for us is the “<code><script></code>” tag because it links to our JavaScript file with the <code>src</code> attribute. As we can see, we set the <code>src</code> attribute equal to <code>"main.js"</code> since this is the name of our JavaScript file. When we link a file this way, it means that the HTML file and the JavaScript file have to be in the same folder:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">I’m using the code editor <a href="https://code.visualstudio.com/" target="_blank" rel="noreferrer noopener">Visual Studio Code</a> for development. </p> <p class="wp-block-paragraph">Here, I created the folder “<code>JavaScript</code>” and created two files: “<code>index.html</code>” which contains the content we saw above, and the “<code>main.js</code>” file.</p> <h2 class="wp-block-heading" id="modify-the-html-document-with-javascript">Modify the HTML Document with JavaScript</h2> <p class="wp-block-paragraph">In this section, we will see how to change the HTML document using JavaScript. </p> <p class="wp-block-paragraph">To be able to see the changes, we will have to open the HTML file in some way. There are different ways to do so: We can open the folder where our HTML file and JavaScript file are stored.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">Here, we can just double-click on the “<code>index.html</code>” file. On my computer, this opens the Chrome browser by default. However, if you want to use another browser, you can right-click on the “<code>index.html</code>” file and open the file with another browser.</p> <p class="wp-block-paragraph">Doing it this way, we always have to reload our browser window when we make changes to our files. Thus, this method is not that convenient.</p> <p class="wp-block-paragraph">Using the code editor VS code that I already referred to, we can download the extension “Live Server”. To do so, we go to the “<code>Extensions</code>” symbol on the left-hand side and type in “<code>Live Server</code>“.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">After clicking “<code>Install</code>” you might have to restart VS code in order to make the extension work properly. </p> <p class="wp-block-paragraph">Once we did that, this symbol should appear on the bottom-right corner:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">When we click that, it opens the HTML file in our browser. </p> <p class="wp-block-paragraph">And now, anytime we make changes to the files and save the changes, the changes appear automatically without refreshing the page manually.</p> <p class="wp-block-paragraph">This is what the web page looks like at the moment:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">Now, we head over to the “<code>main.js</code>” file and type in our first line of JavaScript code:</p> <pre class="EnlighterJSRAW" data-enlighter-language="js" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">document.write('JavaScript is awesome!');</pre> <p class="wp-block-paragraph">This method writes the String <code>"JavaScript is awesome!"</code> to our HTML document. </p> <p class="has-global-color-8-background-color has-background wp-block-paragraph"> Notice that we end the command line with a semicolon to define the end of this statement.</p> <p class="wp-block-paragraph">So, when we save the “<code>main.js</code>” file and have a look at our browser again, we can see the following web page:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">Compared to the initial web page, we can see that we added the sentence <code>"JavaScript is awesome!"</code> below the rest of the document’s content. Thus, we successfully modified our HTML document using JavaScript.</p> <p class="wp-block-paragraph">We can also change an existing HTML element. Therefore, we use this method:</p> <pre class="EnlighterJSRAW" data-enlighter-language="js" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">document.getElementById('p1').innerHTML = 'A great paragraph';</pre> <p class="wp-block-paragraph">Here we access an HTML element by its ID, namely “<code>p1</code>” and change the inner HTML to something new, in this case: <code>"A great paragraph"</code>.</p> <p class="wp-block-paragraph">When we save the JavaScript file and have a look at our browser again, we can see that the paragraph indeed has changed from <code>"A paragraph"</code> to <code>"A great paragraph"</code>:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <h2 class="wp-block-heading" id="display-output-with-the-console">Display Output with the Console</h2> <p class="wp-block-paragraph">By now, we saw how to use JavaScript to modify our HTML files and view the changes in our browser. </p> <p class="wp-block-paragraph">However, we will often want to get some JavaScript output without modifying an HTML file. </p> <p class="wp-block-paragraph">In fact, we do not always work with any kind of HTML when working with JavaScript. A great way to do that is to use the “<code>console</code>” object which gives us access to our browser’s debugging console.</p> <p class="wp-block-paragraph">Let’s see how that works. We add a line to our “<code>main.js</code>” file:</p> <pre class="EnlighterJSRAW" data-enlighter-language="js" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">console.log('I am in the console!');</pre> <p class="wp-block-paragraph">We apply the console method “<code>log</code>” here which outputs a message to the browser’s console. </p> <p class="has-global-color-8-background-color has-background wp-block-paragraph"> The <code>console.log()</code> method is the most commonly used way to produce JavaScript output and it is especially useful for testing and debugging.</p> <p class="wp-block-paragraph">When we have a look at our web page again, we can see that nothing has changed:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">That’s because we did not modify the HTML file in any way. </p> <p class="wp-block-paragraph">But when we press <code>CTRL + SHIFT + J</code> on Windows or <code>CMD + SHIFT + J</code> on Mac, the browser console opens:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">And there we can see the text <code>"I am in the console!"</code> that we put in the “<code>console.log()</code>” statement. </p> <p class="wp-block-paragraph">Inside the console, we can also input JavaScript code:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">However, especially for larger projects, it is recommended to use a code editor for development. Using the browser console for coding is good for quick testing.</p> <p class="wp-block-paragraph">If we do not want to use the browser at all for getting JavaScript output, we can produce the output within VS code directly.</p> <p class="wp-block-paragraph">Therefore, we need to install the “Code Runner” extension:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">And we need <a href="https://nodejs.org/en/" target="_blank" rel="noreferrer noopener">Node.js</a> to be installed.</p> <p class="wp-block-paragraph">Our “<code>main.js</code>” file now only contains this one line of code:</p> <pre class="EnlighterJSRAW" data-enlighter-language="js" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">console.log('I am in the console!');</pre> <p class="wp-block-paragraph">When the JavaScript file is opened, we press CTRL + ALT + N on Windows or CMD + OPTIONS + N on Mac. This runs the JavaScript code and the output will be shown in the output window at the bottom of VS Code:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">This method for creating output is also great to use when we have got a lot of code, but we only want to run a part of it.</p> <p class="wp-block-paragraph">We will now add a new line of code to the “main.js” file:</p> <pre class="EnlighterJSRAW" data-enlighter-language="js" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">console.log('I am in the console!'); console.log('Hi console!');</pre> <p class="wp-block-paragraph">If we ran the whole code, both messages would be shown. </p> <p class="wp-block-paragraph">However, we might only want to show the second line. Therefore, we highlight the second line, and then we hit <code>CTRL + ALT + N</code> on Windows or <code>CMD + OPTIONS + N</code> on Mac.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">This way, we are able to run only a fraction of our code which is especially useful when we are working on a big project and we want to test a part of our code.</p> <h2 class="wp-block-heading" id="summary">Summary</h2> <p class="wp-block-paragraph">In this tutorial, we learned what JavaScript is, how to modify HTML documents with it, and how to view JavaScript output in the browser and in our code editor.</p> <p class="wp-block-paragraph">If you wish to learn more about JavaScript, stay tuned for the other tutorials that are being released to Finxter.</p> <p class="wp-block-paragraph">And for more tutorials about other computer and data science-related topics, check out the <a href="https://blog.finxter.com/email-academy/" data-type="page" data-id="12278" target="_blank" rel="noreferrer noopener">Finxter email academy</a>!</p> <p class="wp-block-paragraph">Happy Coding!</p> <hr class="wp-block-separator"/> <p>The post <a href="https://blog.finxter.com/javascript-intro-how-to-see-your-code-output/">[JavaScript Intro] How to See Your Code Output?</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> <article> <h1>Pandas get_dummies() – A Simple Guide with Video</h1> <p>Luis Bruemmer — Sat, 22 Jan 2022 15:07:07 +0000</p> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"> </div></figure> <p class="wp-block-paragraph">In this tutorial, we will learn all about the <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a> function <code>get_dummies()</code>. This method converts categorical data into dummy or indicator variables.</p> <p class="wp-block-paragraph">Here are the parameters from the <a href="https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html" target="_blank" rel="noreferrer noopener">official documentation</a>:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><strong>Parameter</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>data</code></td><td>array-like, Series, or DataFrame</td><td>Data of which to get the dummy indicators.</td></tr><tr><td><code>prefix</code></td><td><code><a href="https://blog.finxter.com/python-str-function/" data-type="post" data-id="23735" target="_blank" rel="noreferrer noopener">str</a></code>, list of <code>str</code>, or <code><a href="https://blog.finxter.com/python-dict/" data-type="post" data-id="19866" target="_blank" rel="noreferrer noopener">dict</a></code> of <code>str</code>,<br>default <code>None</code></td><td>String to append to DataFrame column names. Pass a list with the length equal to the number of columns when calling <code>get_dummies()</code> on a DataFrame. Alternatively, <code>prefix</code> can be a <a href="https://blog.finxter.com/python-dictionary/" data-type="post" data-id="5232" target="_blank" rel="noreferrer noopener">dictionary</a> mapping the column names to the prefixes.</td></tr><tr><td><code>prefix_sep</code></td><td><code>str</code>, default ‘_’</td><td>Separator/delimiter to use if <code>prefix</code> is appended. Or pass a list or dictionary as with the prefix.</td></tr><tr><td><code>dummy_na</code></td><td><code><a href="https://blog.finxter.com/python-bool/" data-type="post" data-id="17841" target="_blank" rel="noreferrer noopener">bool</a></code>, default <code>False</code></td><td>Add a column to indicate the <code>NaN</code> values, if <code>False</code>: <code>NaN</code> values are ignored.</td></tr><tr><td><code>columns</code></td><td>list-like, default <code>None</code></td><td>Column names in the DataFrame to be encoded. If <br>columns is <code>None</code>: all the columns with object or category <code>dtype</code> will be converted.</td></tr><tr><td><code>sparse</code></td><td><code>bool</code>, default <code>False</code></td><td>Whether the dummy-encoded columns should be backed by a <code>SparseArray</code> (<code>True</code>) or by a regular NumPy array (<code>False</code>).</td></tr><tr><td><code>drop_first</code></td><td><code>bool</code>, default <code>False</code></td><td>Whether to get k-1 dummies out of k categorical levels by removing the first level.</td></tr><tr><td><code>dtype</code></td><td>dtype, default <code>np.uint8</code></td><td>Data type for the new columns. Only a single <code>dtype</code> is allowed.</td></tr><tr><td></td><td></td><td></td></tr><tr><td><strong>Returns</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td></td><td>DataFrame</td><td>Dummy-coded data</td></tr></tbody></table></figure> <h2 class="wp-block-heading">The Basic Functionality of get_dummies()</h2> <p class="wp-block-paragraph">We will start with a simple example to get to understand how and where we can apply the <code>get_dummies()</code> method and how exactly it works:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd x = ['a', 'b', 'c', 'a', 'c'] pd.get_dummies(x) </pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>a</strong></td><td><strong>b</strong></td><td><strong>c</strong></td></tr><tr><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>1</td><td>0</td><td>1</td><td>0</td></tr><tr><td>2</td><td>0</td><td>0</td><td>1</td></tr><tr><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr><td>4</td><td>0</td><td>0</td><td>1</td></tr></tbody></table></figure> <p class="wp-block-paragraph">First, we import the Pandas library to be able to use the method. </p> <p class="wp-block-paragraph">Second, we create a simple <a rel="noreferrer noopener" href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank">Python list</a> that contains several characters and we assign this list to the variable “<code>x</code>“. </p> <p class="wp-block-paragraph">Third, we apply the <code>get_dummies()</code> function and inside the function’s parenthesis, we put in the list “<code data-enlighter-language="generic" class="EnlighterJSRAW">x</code>” as the argument. </p> <p class="wp-block-paragraph">The output is a Pandas data frame.</p> <p class="wp-block-paragraph">The data frame consists of the columns “<code>a</code>“, “<code>b</code>“, and “<code>c</code>” and the rows “<code>0</code>“, “<code>1</code>“, “<code>2</code>“, “<code>3</code>“, and “<code>4</code>“. The cell entries are either “<code>0</code>” or “<code>1</code>“. </p> <p class="wp-block-paragraph"><strong>So, what exactly is happening here?</strong></p> <p class="wp-block-paragraph">The column labels <code>"a"</code>, <code>"b"</code>, and <code>"c"</code> are the unique characters from the list that we applied (<code>['a', 'b', 'c', 'a', 'c']</code>). </p> <p class="wp-block-paragraph">The number of rows in the data frame equals the length of the list. There are five rows and five characters. The ones and zeros in the data frame are the actual dummy variables. </p> <p class="wp-block-paragraph">When we have a look at the first entry (column: “<code>a</code>“, row: “<code>0</code>“), we observe that this value is a “<code>1</code>“. That means that the first entry of the list is the character <code>"a"</code> because it is in row “0” (remember: a computer program starts counting at 0) and in column “<code>a</code>“.</p> <p class="wp-block-paragraph">Another example is the data frame entry in row “<code>2</code>” and column “<code>c</code>“: This entry is also “<code>1</code>” because in the list there is a <code>"c"</code> in third place. </p> <h2 class="wp-block-heading">Handling NaN Values</h2> <p class="wp-block-paragraph">In this section, we will find out how the <code>get_dummies()</code> function handles <code>NaN</code> values. </p> <p class="wp-block-paragraph">For that reason, we create another Python list. This list contains the same values as the one from the first example, only the last character gets replaced with a <code>NaN</code> value:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np y = ['a', 'b', 'c', 'a', np.nan] pd.get_dummies(y) </pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>a</strong></td><td><strong>b</strong></td><td><strong>c</strong></td></tr><tr><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>1</td><td>0</td><td>1</td><td>0</td></tr><tr><td>2</td><td>0</td><td>0</td><td>1</td></tr><tr><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr><td>4</td><td>0</td><td>0</td><td>0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The new list is assigned to the variable <code>y</code>. </p> <p class="wp-block-paragraph">As we can see, the list contains the unique values <code>"a"</code>, <code>"b"</code>, <code>"c"</code>, and <code>"np.nan"</code>. The latter is a NaN value that we created using the <a href="https://blog.finxter.com/numpy-tutorial/" data-type="post" data-id="1356" target="_blank" rel="noreferrer noopener">Numpy library</a> which is why we had to import that library here. </p> <p class="wp-block-paragraph">The <code>get_dummies()</code> function creates a data frame just like in the first example. </p> <p class="wp-block-paragraph">Again, we get three columns <code>"a"</code>, <code>"b"</code>, and <code>"c"</code> and five rows. The only difference compared to the first example is the last row. Here, we have zeros exclusively. That’s because the last value from the list is a <code>NaN</code> value which we can’t assign to either <code>"a"</code>, <code>"b"</code>, or <code>"c"</code>.</p> <p class="wp-block-paragraph">However, we can make the <code>NaN</code> value visible in the resulting data frame by applying the <code>dummy_na</code> parameter:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.get_dummies(y, dummy_na=True)</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>a</strong></td><td><strong>b</strong></td><td><strong>c</strong></td><td><strong>NaN</strong></td></tr><tr><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td></tr><tr><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>2</td><td>0</td><td>0</td><td>1</td><td>0</td></tr><tr><td>3</td><td>1</td><td>0</td><td>0</td><td>0</td></tr><tr><td>4</td><td>0</td><td>0</td><td>0</td><td>1</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We set this parameter to <code>True</code>. That way, we add another column with the label <code>NaN</code>. </p> <p class="wp-block-paragraph">In the resulting data frame, the last row’s <code>NaN</code> entry is now <code>1</code> because of the <code>NaN</code> value in the list.</p> <h2 class="wp-block-heading">Apply get_dummies() to a DataFrame</h2> <p class="wp-block-paragraph">By now, we have seen how to apply the <code>get_dummies()</code> function on lists. </p> <p class="wp-block-paragraph">However, we can also apply this function to DataFrames. So, let’s create a simple data frame:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df = pd.DataFrame({'A': ['a', 'b', 'b'], 'B': ['a', 'c', 'b'], 'C': [1,2,3], 'D': [4,5,6]}) print(df) </pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>A</strong></td><td><strong>B</strong></td><td><strong>C</strong></td><td><strong>D</strong></td></tr><tr><td>0</td><td>a</td><td>a</td><td>1</td><td>4</td></tr><tr><td>1</td><td>b</td><td>c</td><td>2</td><td>5</td></tr><tr><td>2</td><td>b</td><td>b</td><td>3</td><td>6</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We get four columns <code>"A"</code>, <code>"B"</code>, <code>"C"</code>, and <code>"D"</code> and three rows <code>"0"</code>, <code>"1"</code>, and <code>"2"</code>. The columns <code>"A"</code> and <code>"B"</code> contain characters, whereas columns <code>"C"</code> and <code>"D"</code> contain integer values.</p> <p class="wp-block-paragraph">Now, we apply <code>get_dummies()</code> with this DataFrame:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.get_dummies(df)</pre> <p class="wp-block-paragraph"><strong>Result:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>C</strong></td><td><strong>D</strong></td><td><strong>A_a</strong></td><td><strong>A_b</strong></td><td><strong>B_a</strong></td><td><strong>B_b</strong></td><td><strong>B_c</strong></td></tr><tr><td>0</td><td>1</td><td>4</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>1</td><td>2</td><td>5</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>2</td><td>3</td><td>6</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The columns <code>"C"</code> and <code>"D"</code> remain unchanged because only columns with either “object” or “category” data type will be converted.</p> <p class="wp-block-paragraph">We also get two <code>"A_"</code> columns and three <code>"B_"</code> columns. That’s because in the initial data frame there are only two unique values in column <code>"A"</code> and three unique values in column <code>"B"</code>. </p> <p class="wp-block-paragraph">The ones and zeros in the resulting data frame are the dummy variables, just as in the examples above where we applied the <code>get_dummies()</code> function on lists. </p> <p class="wp-block-paragraph">For example, the <code>"1"</code> in the first row of the <code>"A_a"</code> column means that the first value from the <code>"A"</code> column in the initial data frame is the character <code>"a"</code>.</p> <h2 class="wp-block-heading">The “columns” parameter</h2> <p class="wp-block-paragraph">Especially in large data frames, it might be that we only want to convert specific columns instead of converting every possible column. Therefore, we use the “<code>columns</code>” parameter which we assign the labels of the columns that we want to convert.</p> <p class="wp-block-paragraph">We use the data frame again that we created in the previous section:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>A</strong></td><td><strong>B</strong></td><td><strong>C</strong></td><td><strong>D</strong></td></tr><tr><td>0</td><td>a</td><td>a</td><td>1</td><td>4</td></tr><tr><td>1</td><td>b</td><td>c</td><td>2</td><td>5</td></tr><tr><td>2</td><td>b</td><td>b</td><td>3</td><td>6</td></tr></tbody></table></figure> <p class="wp-block-paragraph">But now, when applying the <code>get_dummies()</code> function, we add the “<code>columns</code>” parameter and assign it a list with the list entry <code>"B"</code> to state that we only want to get the dummy variables of this column:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.get_dummies(df, columns=['B'])</pre> <p class="wp-block-paragraph"><strong>Result:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>A</strong></td><td><strong>C</strong></td><td><strong>D</strong></td><td><strong>B_a</strong></td><td><strong>B_b</strong></td><td><strong>B_c</strong></td></tr><tr><td>0</td><td>a</td><td>1</td><td>4</td><td>1</td><td>0</td><td>0</td></tr><tr><td>1</td><td>b</td><td>2</td><td>5</td><td>0</td><td>0</td><td>1</td></tr><tr><td>2</td><td>b</td><td>3</td><td>6</td><td>0</td><td>1</td><td>0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The first three columns of the resulting data frame are the unchanged columns. They are the same as in the initial data frame. </p> <p class="wp-block-paragraph">The columns <code>"C"</code> and <code>"D"</code> are unchanged because they are neither from the “<code>object</code>” data type nor from the “<code>category</code>” data type. </p> <p class="wp-block-paragraph">And <code>"A"</code> remains unchanged because we did not add it to our “<code>columns</code>” parameter’s list.</p> <p class="wp-block-paragraph">The last three columns in the resulting data frame are the encoded variables from column <code>"B"</code>.</p> <p class="wp-block-paragraph">By default, the <code>columns</code> parameter is set to <code>None</code>. This way, all columns with either “<code>object</code>” or “<code>category</code>” data type will be converted. We saw that in the previous examples where we did not set the <code>columns</code> parameter.</p> <h2 class="wp-block-heading">Changing the Prefixes</h2> <p class="wp-block-paragraph">We can change the prefixes for our new columns in the resulting data frames by adding the <code>prefix</code> parameter. </p> <p class="wp-block-paragraph">Again, we use the data frame <code>df</code> for this purpose:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>A</strong></td><td><strong>B</strong></td><td><strong>C</strong></td><td><strong>D</strong></td></tr><tr><td>0</td><td>a</td><td>a</td><td>1</td><td>4</td></tr><tr><td>1</td><td>b</td><td>c</td><td>2</td><td>5</td></tr><tr><td>2</td><td>b</td><td>b</td><td>3</td><td>6</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Now, we perform the <code>get_dummies()</code> operation on this data frame and add the <code>prefix</code> parameter which we assign a list with the prefix labels for the converted columns. </p> <p class="wp-block-paragraph">This list should be the same <a href="https://blog.finxter.com/python-len/" data-type="post" data-id="22386" target="_blank" rel="noreferrer noopener">length</a> as the number of columns that get converted:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.get_dummies(df, prefix=['column1', 'column2'])</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>C</strong></td><td><strong>D</strong></td><td><strong>column1_a</strong></td><td><strong>column1_b</strong></td><td><strong>column2_a</strong></td><td><strong>column2_b</strong></td><td><strong>column2_c</strong></td></tr><tr><td>0</td><td>1</td><td>4</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>1</td><td>2</td><td>5</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>2</td><td>3</td><td>6</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Since two columns get encoded (<code>"A"</code> and <code>"B"</code>), we apply two prefixes to the <code>prefix</code> parameter, <code>"column1"</code> and <code>"column2"</code>.</p> <p class="wp-block-paragraph">The resulting data frame shows the new prefixes for the encoded columns.</p> <p class="wp-block-paragraph">If we want to, we can also change the prefix separator by adding the <code>prefix_sep</code> parameter:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.get_dummies(df, prefix=['column1', 'column2'], prefix_sep=':')</pre> <p class="wp-block-paragraph"><strong>Result:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>C</strong></td><td><strong>D</strong></td><td><strong>column1:a</strong></td><td><strong>column1:b</strong></td><td><strong>column2:a</strong></td><td><strong>column2:b</strong></td><td><strong>column2:c</strong></td></tr><tr><td>0</td><td>1</td><td>4</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td></tr><tr><td>1</td><td>2</td><td>5</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>2</td><td>3</td><td>6</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We perform the same <code>get_dummies()</code> operation as before, but we add the <code>prefix_sep</code> parameter and set it to <code>":"</code>. </p> <p class="wp-block-paragraph">By default, the separator is <code>"_"</code>, but we can change it to whatever we want.</p> <h2 class="wp-block-heading">Summary</h2> <p class="wp-block-paragraph">All in all, we learned all about the Pandas function <code>get_dummies()</code>. </p> <p class="wp-block-paragraph">We learned the basic functionality of this method, how to handle <code>NaN</code> values, how to perform the function on data frames as well as lists, how to only encode specific columns, and how to set different prefixes.</p> <p class="wp-block-paragraph">For more tutorials about <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a>, Python libraries, Python in general, or other computer science-related topics, check out the <a href="https://blog.finxter.com/email-academy/" data-type="page" data-id="12278" target="_blank" rel="noreferrer noopener">Finxter email academy</a>.</p> <p class="wp-block-paragraph">Happy Coding! </p> <p>The post <a href="https://blog.finxter.com/pandas-get_dummies-a-simple-guide-with-video/">Pandas get_dummies() – A Simple Guide with Video</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> <article> <h1>Pandas factorize() – A Simple Guide with Video</h1> <p>Luis Bruemmer — Fri, 14 Jan 2022 12:25:46 +0000</p> <p class="wp-block-paragraph">In this tutorial, we will learn how to apply the <a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank">Pandas</a> function <code>factorize()</code>. This function encodes an object as an <strong><em>enumerated type</em></strong> and determines the unique values.</p> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"> </div></figure> <p class="wp-block-paragraph">Here are the parameters from the <a href="https://pandas.pydata.org/docs/reference/api/pandas.factorize.html" target="_blank" rel="noreferrer noopener">official documentation:</a></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><strong>Parameter</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>values</code></td><td><a href="https://blog.finxter.com/iterators-iterables-and-itertools/" data-type="post" data-id="29507" target="_blank" rel="noreferrer noopener">Sequence</a></td><td>A one-dimensional sequence. Sequences that aren’t Pandas objects are coerced to <code>ndarrays</code> before the factorization.</td></tr><tr><td><code>sort</code></td><td><code><a href="https://blog.finxter.com/python-bool/" data-type="post" data-id="17841" target="_blank" rel="noreferrer noopener">bool</a></code>, default: <code>False</code></td><td><a href="https://blog.finxter.com/python-list-sort/" data-type="post" data-id="7176" target="_blank" rel="noreferrer noopener">Sort</a> the uniques and shuffle the codes to maintain the relationship.</td></tr><tr><td><code>na_sentinel</code></td><td><code><a href="https://blog.finxter.com/python-int-function/" data-type="post" data-id="22715" target="_blank" rel="noreferrer noopener">int</a></code> or <code>None</code>, default: -1</td><td>Value to mark <code>NaN</code>-values. If set to “<code>None</code>“, it will not drop the <code>NaN</code> from the <code>uniques</code> of the values.</td></tr><tr><td><code>size_hint</code></td><td><code>int</code>, optional</td><td>Hint to the hash table sizer.</td></tr><tr><td></td><td></td><td></td></tr><tr><td><strong>Returns</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>codes</code></td><td><code>ndarray</code></td><td>An integer <code>ndarray</code> that’s an indexer into <code>uniques</code>.</td></tr><tr><td><code>uniques</code></td><td><code>ndarray</code>, <code>Index</code>, or<br><code>Categorical</code></td><td>The unique values. When the values are Categorical, <code>uniques</code> is a Categorical. When <code>values</code> is another Pandas object, an <code>Index</code> is returned. Otherwise, a one-dimensional <code>ndarray</code> is returned.</td></tr></tbody></table></figure> <h2 class="wp-block-heading">The Basic Functionality of factorize()</h2> <p class="wp-block-paragraph">To get started, we will start with a coding example that explains how the <code>factorize()</code> function works:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a']) </pre> <p class="wp-block-paragraph">We import the <a href="https://blog.finxter.com/how-to-install-pandas-in-python/" data-type="post" data-id="35926" target="_blank" rel="noreferrer noopener">Pandas library</a> at first. Then, we apply the <code>factorize</code> function which we assign a <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a> of characters. We set this function equal to the two variables “<code>codes</code>” and “<code>uniques</code>” because we will get two return values.</p> <p class="wp-block-paragraph">This is how the return values look like:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> codes array([0, 0, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)</pre> <p class="wp-block-paragraph">Variable <code>codes</code> is an array that contains the numeric values from the initial list. </p> <p class="wp-block-paragraph">The best way to see what these numeric values represent is when we put the numeric array below the initial list:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">['c', 'c', 'b', 'd', 'a', 'c', 'a'] [0, 0, 1, 2, 3, 0, 3]</pre> <p class="wp-block-paragraph">We observe that the numeric values are assigned to each unique character in the original list. Since <code>"c"</code> is the first value from the original list, it is assigned the numeric value “<code>0</code>” and so on. </p> <p class="has-base-background-color has-background wp-block-paragraph"> <strong>Remember</strong>: a computer program starts counting at “0”.</p> <p class="wp-block-paragraph">The data type for the “<code>codes</code>” array is “<code>int64</code>” because we get integer values exclusively.</p> <p class="wp-block-paragraph">Variable “<code>uniques</code>” shows the unique values from the initial list which are <code>"c"</code>, <code>"b"</code>, <code>"d"</code>, and <code>"a"</code>. </p> <p class="wp-block-paragraph">The unique values are presented in that order because they occur in that order in the initial list.</p> <h2 class="wp-block-heading">The “sort” Parameter</h2> <p class="wp-block-paragraph">The list we put in the <code>factorize()</code> function in the previous section (<code>['c', 'c', 'b', 'd', 'a', 'c', 'a']</code>) represents some <a href="https://blog.finxter.com/how-to-detect-lowercase-letters-in-python/" data-type="post" data-id="26765" target="_blank" rel="noreferrer noopener">letters</a> from the alphabet. However, the letters here are not ordered alphabetically.</p> <p class="wp-block-paragraph">When we apply the <code>sort</code> parameter, <code>factorize()</code> outputs the list in the same order but enumerates the characters in a sorted way:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'], sort=True) >>> codes array([2, 2, 1, 3, 0, 2, 0], dtype=int64) </pre> <p class="wp-block-paragraph">We perform the same <code>factorize()</code> function as before, but this time, we use the <code>sort</code> parameter and set it equal to <code>True</code>.</p> <p class="wp-block-paragraph">Variable <code>codes</code> now shows the array with the numbers for the unique characters being alphabetically ordered. </p> <p class="wp-block-paragraph">For example, the <code>"c"</code> is assigned the numeric value <code>2</code> because it is the third letter in the alphabet. </p> <p class="has-base-background-color has-background wp-block-paragraph"> <strong>Remember</strong>: computer programs start counting at 0, so 2 is the third value and not the second one.</p> <p class="wp-block-paragraph">The variable <code>uniques</code> now shows the unique values in an alphabetically sorted way:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> uniques array(['a', 'b', 'c', 'd'], dtype=object)</pre> <h2 class="wp-block-heading">Handling Missing Values</h2> <p class="wp-block-paragraph">It might be the case that we have some missing values in our list that we want to perform the <code>factorize()</code> operation on.</p> <p class="wp-block-paragraph">We will change our initial list by replacing one character with a <code>None</code> value. Let’s see how the <code>factorize()</code> method handles this case:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a']) >>> codes array([ 0, -1, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)</pre> <p class="wp-block-paragraph">The second value in the initial list is <code>None</code>. </p> <p class="wp-block-paragraph">In the outputted <code>codes</code> array we can see that the <code>None</code> value gets assigned the numeric value <code>-1</code>. </p> <p class="wp-block-paragraph">The function’s parameter <code>na_sentinel</code> is used to handle missing values. And since we do not specify this parameter here, the function takes the parameter’s default value which is <code>-1</code>.</p> <p class="wp-block-paragraph">However, we can change this value by applying the <code>na_sentinel</code> parameter and assigning it a custom value:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=-10) >>> codes array([ 0, -10, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object) </pre> <p class="wp-block-paragraph">Here, the <code>None</code> value from the initial list was assigned the numeric value <code>-10</code> because we set <code>na_sentinel</code> equal to <code>-10</code>. </p> <p class="wp-block-paragraph">In both examples, the <code>uniques</code> array was the same <code>['c', 'b', 'd', 'a']</code> because the <code>None</code> value does not count as a unique value.</p> <p class="wp-block-paragraph">We can also set the <code>na_sentinel</code> parameter equal to <code>None</code>:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=None) >>> codes array([0, 4, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a', nan], dtype=object)</pre> <p class="wp-block-paragraph">Doing so, the <code>None</code> value in the initial <a rel="noreferrer noopener" href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank">list</a> gets assigned the numeric value <code>4</code> in the <code>codes</code> array. </p> <p class="wp-block-paragraph">That’s because by setting the <code>na_sentinel</code> parameter equal to <code>None</code> we do not drop the <code>None</code> value, but we count it in. </p> <p class="wp-block-paragraph">Since the other characters <code>"c"</code>, <code>"b"</code>, <code>"d"</code>, and <code>"a"</code> get the numeric values 0, 1, 2, and 3 respectively, the <code>None</code> value gets the next numeric value which is 4. Thus, in the <code>uniques</code> array, we can find the value <code>nan</code> after the other characters.</p> <h2 class="wp-block-heading">Factorizing Other Pandas Objects</h2> <p class="wp-block-paragraph">By now, we have only factorized lists. When we factorize other Pandas objects, we get a different type for <code>uniques</code>:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> series = pd.Series(['a', 'b', 'a', 'd']) >>> codes, uniques = pd.factorize(series) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques Index(['a', 'b', 'd'], dtype='object')</pre> <p class="wp-block-paragraph">Here, we factorize a Pandas series. </p> <p class="wp-block-paragraph">The resulting <code>codes</code> array is structured the same way as in the examples before since we get numeric representations for our characters combined in an array. </p> <p class="wp-block-paragraph">However, the <code>uniques</code> output has changed because the <a href="https://blog.finxter.com/python-type/" data-type="post" data-id="23967" target="_blank" rel="noreferrer noopener">type</a> of the output is now <code>Index</code> instead of “array” like in the examples above.</p> <p class="wp-block-paragraph">We can also factorize a <code>Categorical</code> object:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> category = pd.Categorical(['a', 'b', 'a', 'd']) >>> codes, uniques = pd.factorize(category) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques ['a', 'b', 'd'] Categories (3, object): ['a', 'b', 'd']</pre> <p class="wp-block-paragraph">Again, the <code>codes</code> array is from the type <code>array</code> just like before. But <code>uniques</code> is now from the type <code>Categories</code>.</p> <p class="wp-block-paragraph">One special thing about <code>Categorical</code> happens when we assign the parameter <code>categories</code> to it:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> category = pd.Categorical(['a', 'b', 'a', 'd'], categories=['a', 'b', 'c', 'd']) >>> codes, uniques = pd.factorize(category) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques ['a', 'b', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] </pre> <p class="wp-block-paragraph">We take the same characters for the factorization as in the two examples before. </p> <p class="wp-block-paragraph">But this time, we apply the <code>categories</code> parameter and assign it the list <code>['a', 'b', 'c', 'd']</code> to determine which categories we want to get. </p> <p class="wp-block-paragraph">As we can see, in the category list, there is a <code>"c"</code>. However, there is no <code>"c"</code> in the list that gets factorized <code>['a', 'b', 'a', 'd']</code>.</p> <p class="wp-block-paragraph">Variable <code>codes</code> remains unchanged, but <code>uniques</code> now has the <code>c</code> added to the <code>Categories</code> list although there is no <code>c</code> to be factorized.</p> <h2 class="wp-block-heading">Summary</h2> <p class="wp-block-paragraph">All in all, we learned all about the Pandas function <code>factorize()</code> in this tutorial. We learned the basic functionality of this method, how to sort the values, how to handle missing values, and how to factorize different kinds of Pandas objects.</p> <p class="wp-block-paragraph">For more tutorials about <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a>, <a href="https://blog.finxter.com/the-complete-python-library-guide/" data-type="post" data-id="3414" target="_blank" rel="noreferrer noopener">Python libraries</a>, <a href="https://blog.finxter.com/python-crash-course/" data-type="post" data-id="3951" target="_blank" rel="noreferrer noopener">Python in general</a>, or other computer science-related topics, check out the <a href="https://blog.finxter.com/email-academy/" data-type="page" data-id="12278" target="_blank" rel="noreferrer noopener">Finxter email academy</a>.</p> <p class="wp-block-paragraph">Happy Coding!</p> <p>The post <a href="https://blog.finxter.com/pandas-factorize-a-simple-guide-with-video/">Pandas factorize() – A Simple Guide with Video</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> <article> <h1>Pandas merge_asof() – A Simple Guide with Video</h1> <p>Luis Bruemmer — Sat, 08 Jan 2022 16:36:45 +0000</p> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"> </div></figure> <p class="wp-block-paragraph">In this tutorial, we will learn how to apply the <code>merge_asof()</code> function. Described in one sentence, this method performs a merge similar to a left join where we match on near keys instead of equal keys. Thus, the function is especially useful when working with time-series data.</p> <p class="wp-block-paragraph">Here are the parameters from the <a href="https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html" target="_blank" rel="noreferrer noopener">official documentation</a>:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><strong>Parameter</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>left</code></td><td>DataFrame or named Series</td><td></td></tr><tr><td><code>right</code></td><td>DataFrame or named Series</td><td></td></tr><tr><td><code>on</code></td><td>label</td><td>Field name to join on. Must be contained in both<br>DataFrames. Data must be ordered. Must be a numeric column. On or left_on/right_on must be used.</td></tr><tr><td><code>left_on</code></td><td>label</td><td>Field name to join on in the left DataFrame.</td></tr><tr><td><code>right_on</code></td><td>label</td><td>Field name to join on in the right DataFrame.</td></tr><tr><td><code>left_index</code></td><td><code><a href="https://blog.finxter.com/python-bool/" data-type="post" data-id="17841" target="_blank" rel="noreferrer noopener">bool</a></code></td><td>Use the index of the left DataFrame as the join key.</td></tr><tr><td><code>right_index</code></td><td><code>bool</code></td><td>Use the index of the right DataFrame as the join key.</td></tr><tr><td><code>by</code></td><td>column name or list of column<br>names</td><td>Match on these columns before performing the<br>merge operation.</td></tr><tr><td><code>left_by</code></td><td>column name</td><td>Field names to match on in the left DataFrame.</td></tr><tr><td><code>right_by</code></td><td>column name</td><td>Field names to match on in the right DataFrame.</td></tr><tr><td><code>suffixes</code></td><td>2-length sequence: <a href="https://blog.finxter.com/the-ultimate-guide-to-python-tuples/" data-type="post" data-id="12043" target="_blank" rel="noreferrer noopener">tuple</a>, <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a>, etc.</td><td>Suffix to apply to the overlapping column names in the left and right side, respectively.</td></tr><tr><td><code>tolerance</code></td><td><code><a href="https://blog.finxter.com/python-int-function/" data-type="post" data-id="22715" target="_blank" rel="noreferrer noopener">int</a></code> or <code>Timedelta</code>,<br>optional, default: <code>None</code></td><td>Select a tolerance within this range. Must be compatible with the merge index.</td></tr><tr><td><code>allow_exact_matches</code></td><td>bool, default <code data-enlighter-language="generic" class="EnlighterJSRAW">True</code></td><td>If set to <code>True</code>: allow matching with the same ‘<code>on</code>’ value (i.e. <a href="https://blog.finxter.com/python-less-than-or-equal-to/" data-type="post" data-id="30938">less-th</a><a href="https://blog.finxter.com/python-less-than-or-equal-to/" data-type="post" data-id="30938" target="_blank" rel="noreferrer noopener">a</a><a href="https://blog.finxter.com/python-less-than-or-equal-to/" data-type="post" data-id="30938">n-or-equal-to</a> / <a href="https://blog.finxter.com/python-greater-than-or-equal-to/" data-type="post" data-id="30888" target="_blank" rel="noreferrer noopener">greater-than-or-equal-to</a>)<br>If set to <code>False</code>: don’t match the same ‘<code>on</code>’ value (i.e., strictly <a href="https://blog.finxter.com/python-less-than/" data-type="post" data-id="30841" target="_blank" rel="noreferrer noopener">less-than</a> / strictly <a href="https://blog.finxter.com/python-greater-than/" data-type="post" data-id="30762" target="_blank" rel="noreferrer noopener">greater-than</a>).</td></tr><tr><td><code>direction</code></td><td><code>'backward'</code>, <code>'forward'</code>, or<br><code>'nearest'</code>, default: <code>'backward'</code></td><td>Whether to search for the prior, subsequent, or<br>closest matches.</td></tr><tr><td></td><td></td><td></td></tr><tr><td><strong>Returns</strong></td><td><strong>Type</strong></td><td></td></tr><tr><td>merged</td><td>DataFrame</td><td></td></tr></tbody></table></figure> <h2 class="wp-block-heading">Basic Functionality of merge_asof()</h2> <p class="wp-block-paragraph">To get started, we will create two data frames “<code>df1</code>” and “<code>df2</code>“:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd df1 = pd.DataFrame({ 'time':[pd.Timestamp('2021-12-15 12:00:00'), pd.Timestamp('2021-12-15 12:00:01'), pd.Timestamp('2021-12-15 12:00:05'), pd.Timestamp('2021-12-15 12:00:07'), pd.Timestamp('2021-12-15 12:00:09'), pd.Timestamp('2021-12-15 12:00:12')], 'price': [12, 14, 8, 7, 11, 15] }) print(df1) </pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Let’s have a look at the second DataFrame:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df2 = pd.DataFrame({ 'time':[pd.Timestamp('2021-12-15 12:00:00'), pd.Timestamp('2021-12-15 12:00:02'), pd.Timestamp('2021-12-15 12:00:04'), pd.Timestamp('2021-12-15 12:00:08'), pd.Timestamp('2021-12-15 12:00:10'), pd.Timestamp('2021-12-15 12:00:11')], 'price': [5, 7, 9, 12, 8, 12] }) print(df2) </pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:02</td><td>7</td></tr><tr><td>2</td><td>2021-12-15 12:00:04</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:08</td><td>12</td></tr><tr><td>4</td><td>2021-12-15 12:00:10</td><td>8</td></tr><tr><td>5</td><td>2021-12-15 12:00:11</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Both data frames contain a “<code>time</code>” column and a “<code>price</code>” column respectively. However, the prices and timestamps from both data frames differ from each other.</p> <p class="wp-block-paragraph">Now that we created the data frames, we are ready to do our first <code>merge_asof()</code> operation:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time')</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>9</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12</td></tr><tr><td> </td><td> </td><td> </td><td> </td></tr></tbody></table></figure> <p class="wp-block-paragraph">We put three arguments inside the <code>merge_asof()</code> function. The first two arguments are the two data frames that we want to merge, “<code>df1</code>” and “<code>df2</code>“. The third argument is the “<code>on</code>” parameter which expects the label of the column that we want to merge on. We set this parameter equal to “<code>time</code>“, thus we want to merge on the “<code>time</code>” column.</p> <p class="wp-block-paragraph">The resulting data frame has two price columns “<code>price_x</code>” and “<code>price_y</code>“. The “<code>time</code>” column here contains the same timestamps as “<code>df1</code>“. That’s because we set this data frame as the first argument and thus the left data frame. And since the <strong>asof merge</strong> is similar to a <strong>left join</strong>, we get the values from the left data frame.</p> <p class="wp-block-paragraph">When we take a look at the new price columns, we observe that the “<code>price_x</code>” values equal the price values from “<code>df1</code>“. That’s also the case because “<code>df1</code>” is the left data frame.</p> <p class="wp-block-paragraph">The interesting column here is the “<code>price_y</code>” column. The “<code>price_y</code>” value in the first row equals the price value from “<code>df2</code>” in that same row. That’s because the first timestamps from “<code>df1</code>” and “<code>df2</code>” match (they are both <code>"2021-12-15 12:00:00"</code>). </p> <p class="wp-block-paragraph">However, the timestamps in the second row from both data frames differ from each other (<code>"df1": "2021-12-15 12:00:01", "df2": "2021-12-15 12:00:02"</code>). </p> <p class="wp-block-paragraph">The “<code>price_y</code>” value in the second row in the resulting data frame is <code>5</code> and thus unequal to the price value in “<code>df2</code>” in the same row. </p> <p class="wp-block-paragraph">By default, the <code>merge_asof()</code> function performs a backward search. Thus, it takes the price assigned to the backward nearest timestamp from “<code>df2</code>” (<code>"2021-12-15 12:00:00"</code>) which is <code>5</code> in this case:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>5</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Similarly, the “<code>price_y</code>” value in the third row is <code>9</code> because the timestamps from “<code>df1</code>” and “<code>df2</code>” in that row don’t match and the function looks backward for the next value. </p> <p class="wp-block-paragraph">Since the timestamp for the third row in the resulting data frame is <code>"2021-12-15 12:00:05"</code>, it looks for the next value which is backward the nearest to this timestamp in “<code>df2</code>“. The backward nearest timestamp from “<code>df2</code>” is <code>"2021-12-15 12:00:04"</code>. </p> <p class="wp-block-paragraph">Thus, the function takes this row’s price value:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price</strong></td></tr><tr><td>2</td><td>2021-12-15 12:00:04</td><td>9</td></tr></tbody></table></figure> <h2 class="wp-block-heading">Parameter “direction”</h2> <p class="wp-block-paragraph">As mentioned in the previous section, the <code>merge_asof()</code> operation performs a backward search by default because the value for the <code>direction</code> parameter is automatically set to <code>"backward"</code> if not specified otherwise.</p> <p class="wp-block-paragraph">The other two options for the <code>direction</code> parameter are <code>"forward"</code> and <code>"nearest"</code>. We will start with <code>"forward"</code>. As the name suggests, this is the opposite of the backward search, so, we are looking for subsequent matches instead of prior ones.</p> <p class="wp-block-paragraph">We will perform the same <code>merge_asof()</code> operation as before, but this time, with a forward search:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time', direction='forward')</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5.0</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>7.0</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>12.0</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>12.0</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>8.0</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>NaN</td></tr></tbody></table></figure> <p class="wp-block-paragraph">And we compare it to the initial <code>merge_asof()</code> operation with a default backward search:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time')</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>9</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The “<code>time</code>” and “<code>price_x</code>” columns remain unchanged. However, some values in the “<code>price_y</code>” column are different now. </p> <p class="wp-block-paragraph">For example, the “<code>price_y</code>” value in the second row is now 7 instead of 5. The timestamp <code>"2021-12-15 12:00:01"</code> does not exist in “<code>df2</code>“, so the function now looks for the next timestamp in “<code>df2</code>” instead of the previous one. </p> <p class="wp-block-paragraph">The next timestamp is <code>"2021-12-15 12:00:02"</code>, so the function takes this row’s price value which is 7.</p> <p class="wp-block-paragraph">The last option for the “<code>direction</code>” parameter is <code>"nearest"</code>:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time', direction='nearest')</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>12</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The only value that changed here compared to the backward search is the “<code>price_y</code>” value in the fourth row which is now 12 instead of 9. </p> <p class="wp-block-paragraph">That’s because the <code>"nearest"</code> search looks for the closest match. The timestamp <code>"2021-12-15 12:00:07"</code> does not exist in “<code>df2</code>“, so the function looks for the timestamp that is closest to <code>"2021-12-15 12:00:07"</code>. And that is timestamp <code>"2021-12-15 12:00:08"</code>. </p> <p class="wp-block-paragraph">So, the function takes this row’s price value which is 12.</p> <h2 class="wp-block-heading">Allowing Exact Matches</h2> <p class="wp-block-paragraph">It might be that we do not want to include exact matches in our merges, for example, if we only want to get values from unique timestamps. Therefore, we apply the “<code>allow_exact_matches</code>” parameter. This parameter expects a boolean value and is set to “<code>True</code>” by default.</p> <p class="wp-block-paragraph">Again, we perform the initial <code>merge_asof()</code> operation but this time with the “<code>allow_exact_matches</code>” parameter set to “<code>False</code>“:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time', allow_exact_matches=False)</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>NaN</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5.0</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9.0</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>9.0</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12.0</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">And we compare it to the initial <code>merge_of()</code> operation:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time')</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>9</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The only value that changed is the first “<code>price_y</code>” value which is “<code>NaN</code>” instead of 5. That’s because the timestamps in the first row of “<code>df1</code>” and in the one in the first row of “<code>df2</code>” match. And since we do not allow exact matches here, the resulting value is “<code>NaN</code>“.</p> <h2 class="wp-block-heading">Selecting the Tolerance</h2> <p class="wp-block-paragraph">The <code>merge_asof()</code> function provides the “<code>tolerance</code>” parameter. Using this parameter, we can determine how much tolerance we want to allow between our timestamps:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time', tolerance=pd.Timedelta('1s'))</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5.0</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5.0</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9.0</td></tr><tr><td><a>3</a></td><td>2021-12-15 12:00:07</td><td>7</td><td>NaN</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12.0</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The initial <code>merge_asof()</code>:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time')</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>9</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The only difference is that we apply the “<code>tolerance</code>” parameter and set it equal to a <code>Timedelta</code> of one second. </p> <p class="wp-block-paragraph">In the fourth row at the timestamp <code>"2021-12-15 12:00:07"</code> we find a “<code>NaN</code>” value which was 9 in the initial <code>merge_asof()</code> operation. The reason behind that is that the timestamp <code>"2021-12-15 12:00:07"</code> does not exist in “<code>df2</code>“. </p> <p class="wp-block-paragraph">So, the <code>merge_asof()</code> looks for the previous timestamp in “<code>df2</code>“. However, the previous timestamp is <code>"2021-12-15 12:00:04"</code> which lies not within the tolerance of one second. Thus, the price value from that row is not used. So, we get a “<code>NaN</code>” value in the resulting data frame.</p> <h2 class="wp-block-heading">Matching by a Specific Column</h2> <p class="wp-block-paragraph">For this section, we will modify “<code>df1</code>” and “<code>df2</code>” a little bit:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df1['category'] = ['A', 'A', 'A', 'A', 'B', 'B'] print(df1)</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price</strong></td><td><strong>category</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>A</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>A</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>A</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>A</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>B</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>B</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Also:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df2['category'] = ['A', 'A', 'B', 'B', 'B', 'B'] print(df2)</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price</strong></td><td><strong>category</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>5</td><td>A</td></tr><tr><td>1</td><td>2021-12-15 12:00:02</td><td>7</td><td>A</td></tr><tr><td>2</td><td>2021-12-15 12:00:04</td><td>9</td><td>B</td></tr><tr><td>3</td><td>2021-12-15 12:00:08</td><td>12</td><td>B</td></tr><tr><td>4</td><td>2021-12-15 12:00:10</td><td>8</td><td>B</td></tr><tr><td>5</td><td>2021-12-15 12:00:11</td><td>12</td><td>B</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We assigned both data frames a “<code>category</code>” column and added the categories “A” and “B”. Notice that the categories are different in “<code>df1</code>” and “<code>df2</code>“.</p> <p class="wp-block-paragraph">Now, we perform a <code>merge_asof()</code> operation again and we add the “<code>by</code>” parameter.</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time', by='category')</pre> <p class="wp-block-paragraph"><strong>Result:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>category</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>A</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>A</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>A</td><td>7</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>A</td><td>7</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>B</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>B</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We merge the data frames on the “<code>time</code>” column and by the “<code>category</code>” column.</p> <p class="wp-block-paragraph">This was the initial <code>merge_as</code>of() operation:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_asof(df1, df2, on='time')</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>time</strong></td><td><strong>price_x</strong></td><td><strong>price_y</strong></td></tr><tr><td>0</td><td>2021-12-15 12:00:00</td><td>12</td><td>5</td></tr><tr><td>1</td><td>2021-12-15 12:00:01</td><td>14</td><td>5</td></tr><tr><td>2</td><td>2021-12-15 12:00:05</td><td>8</td><td>9</td></tr><tr><td>3</td><td>2021-12-15 12:00:07</td><td>7</td><td>9</td></tr><tr><td>4</td><td>2021-12-15 12:00:09</td><td>11</td><td>12</td></tr><tr><td>5</td><td>2021-12-15 12:00:12</td><td>15</td><td>12</td></tr></tbody></table></figure> <p class="wp-block-paragraph">As we can see, some “<code>price_y</code>” values have changed again. </p> <p class="wp-block-paragraph">For example, the value in the third row is now 7 instead of 9. The category in the third row in the resulting data frame is “<code>A</code>“. And since the timestamp <code>"2021-12-15 12:00:05"</code> from the third row in the resulting data frame does not exist in “<code>df2</code>“, the function looks backward for the next timestamp. The next timestamp backward would be <code>"2021-12-15 12:00:04"</code> and the assigned price value for this timestamp in “<code>df2</code>” is 9. But the category for this timestamp is “<code>B</code>“. </p> <p class="wp-block-paragraph">However, the function looks for the next timestamp from the same category which is “<code>A</code>“. And this timestamp is <code>"2021-12-15 12:00:02"</code> with a price value of 7. Thus, the “<code>price_y</code>” value in the merged data frame in the third row is 7 and not 9.</p> <h2 class="wp-block-heading">Summary</h2> <p class="wp-block-paragraph">All in all, we studied the Pandas function <code>merge_asof()</code>. We learned the basic functionality of this function, how to search in different directions, whether to allow exact matches or not, how to specify a tolerance, and how to perform <code>merge_asof()</code> by specific columns.</p> <p class="wp-block-paragraph">For more tutorials about <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a>, Python <a href="https://blog.finxter.com/the-complete-python-library-guide/" data-type="post" data-id="3414" target="_blank" rel="noreferrer noopener">libraries</a>, <a href="https://blog.finxter.com/python-crash-course/" data-type="post" data-id="3951" target="_blank" rel="noreferrer noopener">Python</a> in general, or other computer science-related topics, check out the <a href="https://blog.finxter.com/blog/" data-type="URL" data-id="https://blog.finxter.com/blog/" target="_blank" rel="noreferrer noopener">Finxter Blog page</a>.</p> <p class="wp-block-paragraph">Happy Coding!</p> <p>The post <a href="https://blog.finxter.com/pandas-merge_asof-a-simple-guide-with-video/">Pandas merge_asof() – A Simple Guide with Video</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> <article> <h1>Pandas merge_ordered() – A Simple Guide with Video</h1> <p>Luis Bruemmer — Thu, 30 Dec 2021 17:47:53 +0000</p> <p class="wp-block-paragraph">In this tutorial, we will learn about the Pandas function <code>merge_ordered()</code>. This method performs a merge with optional <a href="https://blog.finxter.com/scipy-interpolate-1d-2d-and-3d/" data-type="post" data-id="17935" target="_blank" rel="noreferrer noopener">interpolation</a>. It is especially useful for ordered data like time series data.</p> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"> </div></figure> <h2 class="wp-block-heading">Syntax and Parameters</h2> <p class="wp-block-paragraph">Here are the parameters from the <a rel="noreferrer noopener" href="https://pandas.pydata.org/docs/reference/api/pandas.merge_ordered.html" target="_blank">official documentation</a>:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><strong>Parameter</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>left</code></td><td>DataFrame</td><td></td></tr><tr><td><code>right</code></td><td>DataFrame</td><td></td></tr><tr><td><code>on</code></td><td>label or <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a></td><td>Field names to join on. Must be contained in both<br>DataFrames.</td></tr><tr><td><code>left_on</code></td><td>label or list, or array-like</td><td>Field names to join on in the left DataFrame.</td></tr><tr><td><code>right_on</code></td><td>label or list, or array-like</td><td>Field names to join on in the right DataFrame.</td></tr><tr><td><code>left_by</code></td><td>column name or list of<br>column names</td><td>Group the left DataFrame by group columns and<br>merge piece by piece with the right DataFrame.</td></tr><tr><td><code>right_by</code></td><td>column name or list of<br>column names</td><td>Group the right DataFrame by group columns and<br>merge piece by piece with the left DataFrame.</td></tr><tr><td><code>fill_method</code></td><td><code>{'ffill', None}</code>,<br>default: <code>None</code></td><td>Interpolation method for data.</td></tr><tr><td><code>suffixes</code></td><td>list-like, default is<br>(<code>"_x"</code>, <code>"_y"</code>)</td><td>A length-2 sequence where each element is<br>optionally a string indicating the suffix to add to the overlapping column names in left and right respectively. A value of <code>None</code> instead of a string indicates that the column name from left or right should be left as it is. At least one of the values must not be <code>None</code>.</td></tr><tr><td><code>how</code></td><td><code>{'left', 'right', 'outer', 'inner'}</code>,<br>default <code>'outer'</code></td><td><code>left</code>: use keys from left data frame only<br><code>right</code>: use keys from right data frame only<br><code>outer</code>: use union of keys from both data frames<br><code>inner</code>: use intersection of keys from both data frames</td></tr><tr><td></td><td></td><td></td></tr><tr><td><strong>Returns</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td></td><td>DataFrame</td><td>The merged DataFrame output type will the be same<br>as ‘<code>left</code>’, if it is a subclass of DataFrame.</td></tr></tbody></table></figure> <h2 class="wp-block-heading">Basic Example</h2> <p class="wp-block-paragraph">To get started, we will create two data frames:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd df1 = pd.DataFrame({ 'Date': ['15/01/2019', '16/01/2019', '17/01/2019', '18/01/2019', '19/01/2019', '20/01/2019'], 'Price': [16.7, 18.4, 20.0, 19.3, 17.1, 21.2] }) print(df1)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td></tr></tbody></table></figure> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df2 = pd.DataFrame({ 'Date': ['15/01/2019', '17/01/2019', '18/01/2019', '20/01/2019', '21/01/2019', '22/01/2019'], 'Price': [14.6, 19.8, 21.9, 20.2, 17.4, 18.0] }) print(df2)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>14.6</td></tr><tr><td>1</td><td>17.01.2019</td><td>19.8</td></tr><tr><td>2</td><td>18.01.2019</td><td>21.9</td></tr><tr><td>3</td><td>20.01.2019</td><td>20.2</td></tr><tr><td>4</td><td>21.01.2019</td><td>17.4</td></tr><tr><td>5</td><td>22.01.2019</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Here, we import the Pandas library as the first step. Then, we create the two data frames “<code>df1</code>” and “<code>df2</code>” which contain a “<code>Date</code>” column and a “<code>Price</code>” column respectively.</p> <p class="wp-block-paragraph">Now that we created these data frames, in the next step we can perform our first <code>merge_ordered()</code> operation:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date')</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>NaN</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td><td>NaN</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr><tr><td>6</td><td>21.01.2019</td><td>NaN</td><td>17.4</td></tr><tr><td>7</td><td>22.01.2019</td><td>NaN</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We apply the <code>merge_ordered()</code> function and put in the two data frames as the first two arguments of the function. That’s because these are the data frames that we want to merge. The third parameter is the “<code>on</code>” parameter. This parameter expects the column or a list of columns that we want to perform the merge on. We choose the “<code>Date</code>” column here.</p> <p class="wp-block-paragraph">The outputted data frame is longer than each of the two initial data frames. That’s because, by default, the <code>merge_ordered()</code> function performs a so-called “<code>outer</code>” join. That means, we use the union of keys from both our data frames. Since there are eight unique dates, the resulting data frame has eight rows in total.</p> <p class="wp-block-paragraph">We also get two price columns: “<code>Price_x</code>” and “<code>Price_y</code>“. For each date, we get a price from the left data frame (“<code>Price_x</code>“) and the right data frame (“<code>Price_y</code>“). If there is a “<code>NaN</code>” value, that means, for this specific date, we have only one price value. For example, for the <code>"16.01.2019"</code>, we do not get a “<code>Price_y</code>” value because this date is only found in the first data frame.</p> <h2 class="wp-block-heading">The “fill_method” parameter</h2> <p class="wp-block-paragraph">As we saw in the example above, there were some missing values labeled with “<code>NaN</code>“:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>NaN</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td><td>NaN</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr><tr><td>6</td><td>21.01.2019</td><td>NaN</td><td>17.4</td></tr><tr><td>7</td><td>22.01.2019</td><td>NaN</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We can get rid of these “<code>NaN</code>” values by replacing these values with the previous value. We achieve that by applying the “<code>fill_method</code>” parameter and assigning it to “<code>ffill</code>“:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date', fill_method='ffill')</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>14.6</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td><td>21.9</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr><tr><td>6</td><td>21.01.2019</td><td>21.2</td><td>17.4</td></tr><tr><td>7</td><td>22.01.2019</td><td>21.2</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Now, we do not have any more missing values here. For example, the “<code>NaN</code>” value in the “<code>Price_y</code>” column for the date <code>"16.01.2019"</code> was replaced with the previous value from that column (“<code>14.6</code>“).</p> <p class="wp-block-paragraph">If multiple values are missing directly one after the other, all missing values get replaced by the last available value. For example, the last two values from the “<code>Price_x</code>” column were missing. They were both replaced by the value of the third last row which was <code>"21.2"</code>.</p> <h2 class="wp-block-heading">The “suffixes” parameter</h2> <p class="wp-block-paragraph">In the previous example, the two price columns were named “<code>Price_x</code>” and “<code>Price_y</code>” by default. However, we can change these labels by applying the “<code>suffixes</code>” parameter:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', suffixes=['_leftDF', '_rightDF'])</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_leftDF</strong></td><td><strong>Price_rightDF</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>14.6</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td><td>21.9</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr><tr><td>6</td><td>21.01.2019</td><td>21.2</td><td>17.4</td></tr><tr><td>7</td><td>22.01.2019</td><td>21.2</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We performed the same <code>merge_ordered()</code> operation as before. But this time, we added the “<code>suffixes</code>” parameter and assigned it a list with the strings “<code>_leftDF</code>” and “<code>_rightDF</code>“. The two price columns in the resulting data frame are now called “<code>Price_leftDF</code>” and “<code>Price_rightDF</code>“.</p> <p class="wp-block-paragraph">As the name of the parameter suggests, we only change the suffixes here, not the whole label. That’s why the column labels still say “<code>Price</code>” before the suffixes because the initial column label said “<code>Price</code>” and we only added the suffixes after that label.</p> <h2 class="wp-block-heading">The different kinds of joins</h2> <p class="wp-block-paragraph">As mentioned in the introduction, by default the <code>merge_ordered()</code> function performs an outer <a href="https://blog.finxter.com/python-join-list-of-dataframes/" data-type="post" data-id="9780" target="_blank" rel="noreferrer noopener">join</a>. That means we take the union of keys from both data frames.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">But we can change that by using the “<code>how</code>” parameter.</p> <p class="wp-block-paragraph">Another type of join is the “inner” join which uses the intersection of keys from both data frames:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='inner')</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>2</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>3</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr></tbody></table></figure> <p class="wp-block-paragraph">That means, we only get the dates that are found in both data frames.</p> <p class="wp-block-paragraph">The remaining two options the “<code>how</code>” parameter provides us with are the “<code>left</code>” join and the “<code>right</code>” join. The left join uses only the keys from the left data frame.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='left')</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>14.6</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td><td>21.9</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Whereas the right join only uses the keys from the right data frame.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='right')</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>14.6</td></tr><tr><td>1</td><td>17.01.2019</td><td>20.0</td><td>19.8</td></tr><tr><td>2</td><td>18.01.2019</td><td>19.3</td><td>21.9</td></tr><tr><td>3</td><td>20.01.2019</td><td>21.2</td><td>20.2</td></tr><tr><td>4</td><td>21.01.2019</td><td>21.2</td><td>17.4</td></tr><tr><td>5</td><td>22.01.2019</td><td>21.2</td><td>18.0</td></tr></tbody></table></figure> <h2 class="wp-block-heading">Grouping by group columns</h2> <p class="wp-block-paragraph">For this section, we will modify “<code>df1</code>” a little bit:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df1['Category'] = ['A', 'A', 'A', 'B', 'B', 'B'] print(df1)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price</strong></td><td><strong>Category</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>A</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>A</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>A</td></tr><tr><td>3</td><td>18.01.2019</td><td>19.3</td><td>B</td></tr><tr><td>4</td><td>19.01.2019</td><td>17.1</td><td>B</td></tr><tr><td>5</td><td>20.01.2019</td><td>21.2</td><td>B</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We added a column called “<code>Category</code>” and assigned the categories “<code>A</code>” or “<code>B</code>” to each row.</p> <p class="wp-block-paragraph">“<code>df2</code>” remains unchanged:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>14.6</td></tr><tr><td>1</td><td>17.01.2019</td><td>19.8</td></tr><tr><td>2</td><td>18.01.2019</td><td>21.9</td></tr><tr><td>3</td><td>20.01.2019</td><td>20.2</td></tr><tr><td>4</td><td>21.01.2019</td><td>17.4</td></tr><tr><td>5</td><td>22.01.2019</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">Now, we apply the “<code>left_by</code>” parameter and assign it the value column label “<code>Category</code>“:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge_ordered(df1, df2, on='Date', left_by="Category")</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Date</strong></td><td><strong>Price_x</strong></td><td><strong>Category</strong></td><td><strong>Price_y</strong></td></tr><tr><td>0</td><td>15.01.2019</td><td>16.7</td><td>A</td><td>14.6</td></tr><tr><td>1</td><td>16.01.2019</td><td>18.4</td><td>A</td><td>NaN</td></tr><tr><td>2</td><td>17.01.2019</td><td>20.0</td><td>A</td><td>19.8</td></tr><tr><td>3</td><td>18.01.2019</td><td>NaN</td><td>A</td><td>21.9</td></tr><tr><td>4</td><td>20.01.2019</td><td>NaN</td><td>A</td><td>20.2</td></tr><tr><td>5</td><td>21.01.2019</td><td>NaN</td><td>A</td><td>17.4</td></tr><tr><td>6</td><td>22.01.2019</td><td>NaN</td><td>A</td><td>18.0</td></tr><tr><td>7</td><td>15.01.2019</td><td>NaN</td><td>B</td><td>14.6</td></tr><tr><td>8</td><td>17.01.2019</td><td>NaN</td><td>B</td><td>19.8</td></tr><tr><td>9</td><td>18.01.2019</td><td>19.3</td><td>B</td><td>21.9</td></tr><tr><td>10</td><td>19.01.2019</td><td>17.1</td><td>B</td><td>NaN</td></tr><tr><td>11</td><td>20.01.2019</td><td>21.2</td><td>B</td><td>20.2</td></tr><tr><td>12</td><td>21.01.2019</td><td>NaN</td><td>B</td><td>17.4</td></tr><tr><td>13</td><td>22.01.2019</td><td>NaN</td><td>B</td><td>18.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">This way, we group the left data frame by the “<code>Category</code>” column and merge that piece by piece with the right data frame.</p> <p class="wp-block-paragraph">When we look at the resulting data frame, we can observe that, for example, the “<code>Price_x</code>” entry for the date <code>"18.01.2019"</code> in row 3 is “<code>NaN</code>” although there is an entry for that date in “<code>df1</code>“. However, in “<code>df1</code>“, the date is assigned to the category “<code>B</code>“. So, in the merged data frame, the “<code>Price_x</code>” value for the date <code>"18.01.2019"</code> is found in row 9 with category “<code>B</code>“.</p> <p class="wp-block-paragraph">If we had a group column in the right data frame, we could do the same with the “<code>right_by</code>” parameter.</p> <h2 class="wp-block-heading">Summary</h2> <p class="wp-block-paragraph">All in all, we learned how to use the Pandas function <code>merge_ordered()</code>. We saw how to apply the various parameters, how to use the different types of joins, and how to group by group columns.</p> <p class="wp-block-paragraph">For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.</p> <p class="wp-block-paragraph">Happy Coding!</p> <p>The post <a href="https://blog.finxter.com/pandas-merge_ordered-a-simple-guide-with-video/">Pandas merge_ordered() – A Simple Guide with Video</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> <article> <h1>Pandas merge() – A Simple Illustrated Guide with Video</h1> <p>Luis Bruemmer — Sat, 18 Dec 2021 19:05:31 +0000</p> <p class="wp-block-paragraph">In this tutorial, we will learn about the Pandas <code>merge()</code> function. Described in one sentence, the <code>merge()</code> function is used to <strong>combine datasets in various ways</strong>.</p> <p class="wp-block-paragraph">As you go through the tutorial, you can watch the following video guide for ease of understanding:</p> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"> </div></figure> <h2 class="wp-block-heading">Syntax and Parameters</h2> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)</pre> <p class="wp-block-paragraph">Here are the parameters from the <a href="https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge" target="_blank" rel="noreferrer noopener">official documentation</a>:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><strong>Parameter</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>left</code></td><td>DataFrame</td><td></td></tr><tr><td><code>right</code></td><td>DataFrame or Series</td><td>Data frame to merge with</td></tr><tr><td><code>how</code></td><td><code>{'left', 'right', 'outer',<br>'inner', 'cross'}</code>, default <code>'inner'</code></td><td>Merge type to perform:<br><strong>left</strong>: only use the keys from the left DataFrame<br><strong>right</strong>: only use the keys from the right DataFrame<br><strong>outer</strong>: use common keys from both DataFrames<br><strong>inner</strong>: use overlap of keys from both DataFrames<br><strong>cross</strong>: cartesian product from both DataFrames</td></tr><tr><td><code>on</code></td><td>label or list</td><td>Column or index level names to join on.<br>Must be contained in both DataFrames.</td></tr><tr><td><code>left_on</code></td><td>label, or list, or<br>array-like</td><td>Column or index level names to join on in the left DataFrame.</td></tr><tr><td><code>right_on</code></td><td>label, or <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a>, or<br>array-like</td><td>Column or index level names to join on in the right DataFrame.</td></tr><tr><td><code>left_index</code></td><td><code>bool</code>, default <code>False</code> </td><td>Use index from the left DataFrame as join key(s).</td></tr><tr><td><code>right_index</code></td><td><code>bool</code>, default <code>False</code> </td><td>Use index from the right DataFrame as join key(s).</td></tr><tr><td><code>sort</code></td><td><code>bool</code>, default <code>False</code></td><td>Sort the join keys lexicographically in the resulting DataFrame. If set to <code>False</code>, the order of the join keys depends on the join type.</td></tr><tr><td><code>suffixes</code></td><td>list-like, default is<br>(<code>"_x"</code>, <code>"_y"</code>)</td><td>A length-2 sequence where each element is<br>optionally a string indicating the suffix to add to overlapping column names in left and right respectively. At least one of the values must not be <code>None</code>.</td></tr><tr><td><code>copy</code></td><td><code>bool</code>, default <code>True</code></td><td>If <code>False</code>, avoid a copy if possible.</td></tr><tr><td><code>indicator</code></td><td><code>bool</code> or <code>str</code>, default<br><code>False</code></td><td>If set to <code>True</code>, adds a column to the output DataFrame called <code>"_merge"</code> containing information of the source of each row. The column can be given a different name by providing a string as argument.</td></tr><tr><td><code>validate</code></td><td><code>str</code>, optional</td><td>If used, checks if merge is of a specified type.<br>“one_to_one” or “1:1”: checks if merged keys are unique in both left and right datasets.<br>“one_to_many” or “1:m”: checks if merged keys are unique in left dataset.<br>“many_to_one” or “m:1”: checks if merged keys are unique in right dataset.<br>“many_to_many” or “m:m”: allowed, but it does not result in checks.</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The <strong>return value</strong> of the <code>merge()</code> function is a DataFrame consisting of the two merged objects. </p> <h2 class="wp-block-heading">Basic Example</h2> <p class="wp-block-paragraph">To get started, we will first create two data frames that we will be merging in several ways throughout this tutorial:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd df1 = pd.DataFrame({'Player': ['Jeremy', 'Alice', 'Bob', 'John', 'Mary'], 'Age': [31, 25, 27, 28, 21]}) df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary'], 'Position': ['Pitcher', 'Catcher', 'Center Field'], 'Throwing Speed': [71, 80, 81]}) print(df1) print() print(df2) </pre> <p class="wp-block-paragraph">Output:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td></td></tr><tr><td>0</td><td>Jeremy</td><td>31</td><td></td></tr><tr><td>1</td><td>Alice</td><td>25</td><td></td></tr><tr><td>2</td><td>Bob</td><td>27</td><td></td></tr><tr><td>3</td><td>John</td><td>28</td><td></td></tr><tr><td>4</td><td>Mary</td><td>21</td><td></td></tr></tbody></table></figure> <p class="wp-block-paragraph">… and …</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>Center Field</td><td>81</td></tr></tbody></table></figure> <p class="wp-block-paragraph">First, we import the Pandas library. Then we create the two data frames “<code>df1</code>” and “<code data-enlighter-language="generic" class="EnlighterJSRAW">df2</code>“. The first data frame contains the player’s names of a Baseball team, as well as the player’s age.</p> <p class="wp-block-paragraph">The second data frame also contains a part of the player’s names of the first data frame and these player’s position and their throwing speed.</p> <p class="wp-block-paragraph">We finally output the data frames and see the mentioned information in a compact way.</p> <p class="wp-block-paragraph">Now, we apply the <code>merge()</code> function:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, on="Player")</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>25</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>28</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>21</td><td>Center Field</td><td>81</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The first two arguments are the names of the data frames that we want to merge. The third argument is the “<code>on</code>” parameter. The “<code>on</code>” parameter expects the column names to join on and we set it equal to “Player”.</p> <p class="wp-block-paragraph">Thus, Pandas merges these data frames on the “Player” column. The merged data frame only contains the players “Alice”, “John”, and “Mary” because these are the only players contained in both data frames. So, “Jeremy” and “Bob” from the first data frame are dropped.</p> <h2 class="wp-block-heading">The “left_on” and “right_on” Parameters</h2> <p class="wp-block-paragraph">For this section, we will modify the data frame “<code>df2</code>” a little bit:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df2 = pd.DataFrame({'Name': ['Alice', 'John', 'Mary'], 'Position': ['Pitcher', 'Catcher', 'Center Field'], 'Throwing Speed': [71, 80, 81]}) print(df2) </pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Name</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>Center Field</td><td>81</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The only difference is that we changed the label of the “Player” column to “Name”.</p> <p class="wp-block-paragraph">Now, we want to merge the data frames “<code>df1</code>” and “<code>df2</code>” again. However, we cannot do so by applying the “<code>on</code>” parameter and assigning it to “Player” since “<code>df2</code>” does not have a “Player” column anymore.</p> <p class="wp-block-paragraph">Therefore, we use the two parameters “<code>left_on</code>” and “<code>right_on</code>“. We set the “<code>left_on</code>” parameter equal to the column label that we want to use for merging from the first data frame and we do the same with the “<code>right_on</code>” parameter for the second data frame:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, left_on="Player", right_on="Name")</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Name</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>25</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>28</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>21</td><td>Mary</td><td>Center Field</td><td>81</td></tr></tbody></table></figure> <p class="wp-block-paragraph">This way, we can merge data frames by columns with different column labels.</p> <p class="wp-block-paragraph">Since the “Player” column and the “Name” column contain the same information, we might want to get rid of one of them:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df = pd.merge(df1, df2, left_on="Player", right_on="Name") df = df.drop("Name", axis=1) print(df)</pre> <p class="wp-block-paragraph"><strong>Output:</strong></p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>25</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>28</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>21</td><td>Center Field</td><td>81</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We assign the merged data frame to a new variable called “<code>df</code>” and then we apply the <code>drop()</code> function and assign it the “Name” column. The “<code>axis</code>” parameter is set to “1” to state that we want to drop a column and not a row.</p> <p class="wp-block-paragraph">The outputted data frame now misses the “Name” column.</p> <h2 class="wp-block-heading">Merge Using Different Joins</h2> <p class="wp-block-paragraph">In this next step, we will learn about the different types of merges and how to apply them using the “<code>how</code>” parameter.</p> <p class="wp-block-paragraph">Therefore, we change “<code>df2</code>” again. We rename the “Name” column back to “Player”. Also, we add two new players, “Jane” and “Mick”:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary', 'Jane', 'Mick'], 'Position': ['Pitcher', 'Catcher', 'Center Field', 'Pitcher', 'Catcher'], 'Throwing Speed': [71, 80, 81, 79, 75]}) </pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>Center Field</td><td>81</td></tr><tr><td>3</td><td>Jane</td><td>Pitcher</td><td>79</td></tr><tr><td>4</td><td>Mick</td><td>Catcher</td><td>75</td></tr></tbody></table></figure> <p class="wp-block-paragraph">“<code>df1</code>” still looks like this:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(df1)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td></tr><tr><td>0</td><td>Jeremy</td><td>31</td></tr><tr><td>1</td><td>Alice</td><td>25</td></tr><tr><td>2</td><td>Bob</td><td>27</td></tr><tr><td>3</td><td>John</td><td>28</td></tr><tr><td>4</td><td>Mary</td><td>21</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We start with the so-called “inner” join.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">Here, we use the intersection of keys from both our data frames:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, how="inner", on="Player")</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>25</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>28</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>21</td><td>Center Field</td><td>81</td></tr></tbody></table></figure> <p class="wp-block-paragraph">As before, we assign the “<code>on</code>” parameter the value “Player” to specify what column we want to join on. We set the “<code>how</code>” parameter equal to <code>"inner"</code> to state that we want to perform an inner join.</p> <p class="wp-block-paragraph">The outputted data frame contains only the players that occur in both data frames. When we compare that merge to the merge we did in the first section, we can see that they are the same. That’s because <code>"inner"</code> is the default value for the “<code>how</code>” parameter.</p> <p class="wp-block-paragraph">The next type of merge we are looking at is the “outer” join.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">The outer join is the union of keys from both our data frames:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, how="outer", on="Player")</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Jeremy</td><td>31.0</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>Alice</td><td>25.0</td><td>Pitcher</td><td>71.0</td></tr><tr><td>2</td><td>Bob</td><td>27.0</td><td>NaN</td><td>NaN</td></tr><tr><td>3</td><td>John</td><td>28.0</td><td>Catcher</td><td>80.0</td></tr><tr><td>4</td><td>Mary</td><td>21.0</td><td>Center Field</td><td>81.0</td></tr><tr><td>5</td><td>Jane</td><td>NaN</td><td>Pitcher</td><td>79.0</td></tr><tr><td>6</td><td>Mick</td><td>NaN</td><td>Catcher</td><td>75.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">The data frame contains all players from both data frames. Bob, for example, has no value for position and throwing speed because he is only contained in “<code>df1</code>” where we don’t get position and throwing speed values. Similarly, Jane does not have an age value here since she is only found in “<code>df2</code>” which does not provide age information.</p> <p class="wp-block-paragraph">The next merge type is the <code>"left"</code> join.</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">Here, we use the keys from the left data frame exclusively:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, how="left", on="Player")</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Jeremy</td><td>31</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>Alice</td><td>25</td><td>Pitcher</td><td>71.0</td></tr><tr><td>2</td><td>Bob</td><td>27</td><td>NaN</td><td>NaN</td></tr><tr><td>3</td><td>John</td><td>28</td><td>Catcher</td><td>80.0</td></tr><tr><td>4</td><td>Mary</td><td>21</td><td>Center Field</td><td>81.0</td></tr></tbody></table></figure> <p class="wp-block-paragraph">This data frame contains all the players from the left data frame which is “<code>df1</code>” in our case. Thus, Jeremy and Bob have no position and throwing speed values.</p> <p class="wp-block-paragraph">The <code>"right"</code> join is similar to the left join:</p> <div class="wp-block-image"><figure class="aligncenter size-full"></figure></div> <p class="wp-block-paragraph">We are using the keys from the right data frame only:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, how="right", on="Player")</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Alice</td><td>25.0</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>John</td><td>28.0</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Mary</td><td>21.0</td><td>Center Field</td><td>81</td></tr><tr><td>3</td><td>Jane</td><td>NaN</td><td>Pitcher</td><td>79</td></tr><tr><td>4</td><td>Mick</td><td>NaN</td><td>Catcher</td><td>75</td></tr></tbody></table></figure> <p class="wp-block-paragraph">After merging, the data frame contains all the players from the right data frame which is “<code>df2</code>“. That’s why Jane and Mick have no age values here.</p> <p class="wp-block-paragraph">The last join, we are learning about, is a bit special. It is called “cross” join and it creates the cartesian product from both data frames while keeping the order of the keys from the left data frame:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, how="cross")</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player_x</strong></td><td><strong>Age</strong></td><td><strong>Player_y</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td></tr><tr><td>0</td><td>Jeremy</td><td>31</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>1</td><td>Jeremy</td><td>31</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>2</td><td>Jeremy</td><td>31</td><td>Mary</td><td>Center Field</td><td>81</td></tr><tr><td>3</td><td>Jeremy</td><td>31</td><td>Jane</td><td>Pitcher</td><td>79</td></tr><tr><td>4</td><td>Jeremy</td><td>31</td><td>Mick</td><td>Catcher</td><td>75</td></tr><tr><td>5</td><td>Alice</td><td>25</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>6</td><td>Alice</td><td>25</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>7</td><td>Alice</td><td>25</td><td>Mary</td><td>Center Field</td><td>81</td></tr><tr><td>8</td><td>Alice</td><td>25</td><td>Jane</td><td>Pitcher</td><td>79</td></tr><tr><td>9</td><td>Alice</td><td>25</td><td>Mick</td><td>Catcher</td><td>75</td></tr><tr><td>10</td><td>Bob</td><td>27</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>11</td><td>Bob</td><td>27</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>12</td><td>Bob</td><td>27</td><td>Mary</td><td>Center Field</td><td>81</td></tr><tr><td>13</td><td>Bob</td><td>27</td><td>Jane</td><td>Pitcher</td><td>79</td></tr><tr><td>14</td><td>Bob</td><td>27</td><td>Mick</td><td>Catcher</td><td>75</td></tr><tr><td>15</td><td>John</td><td>28</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>16</td><td>John</td><td>28</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>17</td><td>John</td><td>28</td><td>Mary</td><td>Center Field</td><td>81</td></tr><tr><td>18</td><td>John</td><td>28</td><td>Jane</td><td>Pitcher</td><td>79</td></tr><tr><td>19</td><td>John</td><td>28</td><td>Mick</td><td>Catcher</td><td>75</td></tr><tr><td>20</td><td>Mary</td><td>21</td><td>Alice</td><td>Pitcher</td><td>71</td></tr><tr><td>21</td><td>Mary</td><td>21</td><td>John</td><td>Catcher</td><td>80</td></tr><tr><td>22</td><td>Mary</td><td>21</td><td>Mary</td><td>Center Field</td><td>81</td></tr><tr><td>23</td><td>Mary</td><td>21</td><td>Jane</td><td>Pitcher</td><td>79</td></tr><tr><td>24</td><td>Mary</td><td>21</td><td>Mick</td><td>Catcher</td><td>75</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We can observe that we have two “Player” columns now, “Player_x” and “Player_y”. Each player of “<code>df2</code>” is assigned to each player of “<code>df1</code>“. Since both data frames contain five rows, the resulting data frame now has 25 rows (5×5).</p> <h2 class="wp-block-heading">The “indicator” Parameter</h2> <p class="wp-block-paragraph">When we merge two data frames, it might be useful to gain information about the source of the merge keys, whether they were observed only in the left data frame, only in the right data frame, or in both. Therefore, we use the “<code>indicator</code>” parameter:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.merge(df1, df2, how="outer", on="Player", indicator=True)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Player</strong></td><td><strong>Age</strong></td><td><strong>Position</strong></td><td><strong>Throwing Speed</strong></td><td><strong>_merge</strong></td></tr><tr><td>0</td><td>Jeremy</td><td>31.0</td><td>NaN</td><td>NaN</td><td>left_only</td></tr><tr><td>1</td><td>Alice</td><td>25.0</td><td>Pitcher</td><td>71.0</td><td>both</td></tr><tr><td>2</td><td>Bob</td><td>27.0</td><td>NaN</td><td>NaN</td><td>left_only</td></tr><tr><td>3</td><td>John</td><td>28.0</td><td>Catcher</td><td>80.0</td><td>both</td></tr><tr><td>4</td><td>Mary</td><td>21.0</td><td>Center Field</td><td>81.0</td><td>both</td></tr><tr><td>5</td><td>Jane</td><td>NaN</td><td>Pitcher</td><td>79.0</td><td>right_only</td></tr><tr><td>6</td><td>Mick</td><td>NaN</td><td>Catcher</td><td>75.0</td><td>right_only</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We perform an outer join on the “Player” column and add the “indicator” and we set it to “<code>True</code>“. This way, we get an additional column called “<code>_merge</code>” with the entries “<code>left_only</code>“, “<code>right_only</code>“, and “<code>both</code>“.</p> <p class="wp-block-paragraph">For example, Jeremy gets assigned the value “<code>left_only</code>” because he only appears in the left data frame “<code>df1</code>“. And Mary’s “<code>_merge</code>” value is set to “<code>both</code>” because she is found in both data frames.</p> <h2 class="wp-block-heading">Summary</h2> <p class="wp-block-paragraph">In this tutorial, we learned about the Pandas function <code>merge()</code>. We learned how to perform different kinds of merges using the function’s various parameters.</p> <p class="wp-block-paragraph">For more tutorials about <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a>, Python libraries, Python in general, or other computer science-related topics, check out the <a href="https://blog.finxter.com/blog/" data-type="URL" data-id="https://blog.finxter.com/blog/" target="_blank" rel="noreferrer noopener">Finxter Blog page</a>.</p> <p class="wp-block-paragraph">Happy Coding!</p> <p>The post <a href="https://blog.finxter.com/pandas-merge/">Pandas merge() – A Simple Illustrated Guide with Video</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> <article> <h1>Pandas qcut() – A Simple Guide with Video</h1> <p>Luis Bruemmer — Tue, 14 Dec 2021 08:53:21 +0000</p> <p class="wp-block-paragraph">In this tutorial, we learn about the Pandas function <code>qcut()</code>. This function creates unequal-sized bins with the same number of samples in each bin.</p> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper"> </div></figure> <p class="wp-block-paragraph">Here are the parameters from the <a href="https://pandas.pydata.org/docs/reference/api/pandas.qcut.html#pandas.qcut" target="_blank" rel="noreferrer noopener">official documentation</a>:</p> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><strong>Parameter</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>x</code></td><td>1d <code>ndarray</code> or Series</td><td></td></tr><tr><td><code>q</code></td><td><code>int</code> or <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a> of float values</td><td>Number of quantiles. Alternately: array of<br>quantiles.</td></tr><tr><td><code>labels</code></td><td>array or <code>False</code>, default: <code>None</code></td><td>Used as the labels for the resulting bins.<br>Must be of the same length as the resulting bins. If False: returns only integer indicators of the bins. If True: raises an error.</td></tr><tr><td><code>retbins</code></td><td><a href="https://blog.finxter.com/python-bool/" data-type="post" data-id="17841" target="_blank" rel="noreferrer noopener"><code>bool</code></a>, optional</td><td>Whether to return the bins/labels.</td></tr><tr><td><code>precision</code></td><td><code><a href="https://blog.finxter.com/python-int-function/" data-type="post" data-id="22715" target="_blank" rel="noreferrer noopener">int</a></code>, optional</td><td>The precision at which to store and display<br>the bin labels.</td></tr><tr><td><code>duplicates</code></td><td><code>{default 'raise', 'drop'}</code>,<br>optional</td><td>If the bin edges are not unique:<br>raise <code>ValueError</code> or drop the non-uniques.</td></tr><tr><td></td><td></td><td></td></tr><tr><td><strong>Returns</strong></td><td><strong>Type</strong></td><td><strong>Description</strong></td></tr><tr><td><code>out</code></td><td><code>Categorical</code> or <code>Series</code> or array of integers if labels is set to <code>False</code></td><td>The return type depends on the input:<br>a Series of type <code>Category</code> if input is a <code>Series</code>, else <code>Categorical</code>. Bins are represented as categories when categorical data is returned.</td></tr><tr><td><code>bins</code></td><td><code>ndarray</code> of floats</td><td>Only if <code>retbins</code> is set to <code>True</code>.</td></tr></tbody></table></figure> <h2 class="wp-block-heading">Basic Example</h2> <p class="wp-block-paragraph">Let’s create a data frame that we will be using throughout the tutorial:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd df = pd.DataFrame({'Competitor':['Alice', 'Mary', 'John', 'Ann', 'Bob', 'Jane', 'Tom', 'Vincent', 'Ella'], 'Score':[1,6,11,2,9,16,5,2,19]}) print(df)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Competitor</strong></td><td><strong>Score</strong></td></tr><tr><td>0</td><td>Alice</td><td>1</td></tr><tr><td>1</td><td>Mary</td><td>6</td></tr><tr><td>2</td><td>John</td><td>11</td></tr><tr><td>3</td><td>Ann</td><td>2</td></tr><tr><td>4</td><td>Bob</td><td>9</td></tr><tr><td>5</td><td>Jane</td><td>16</td></tr><tr><td>6</td><td>Tom</td><td>5</td></tr><tr><td>7</td><td>Vincent</td><td>2</td></tr><tr><td>8</td><td>Ella</td><td>19</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We import the <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas library</a> and then we <a href="https://blog.finxter.com/how-to-create-a-dataframe-in-pandas/" data-type="post" data-id="16764" target="_blank" rel="noreferrer noopener">create a Pandas data frame</a> which we assign to the variable “<code>df</code>“. The outputted data frame provides information about several competitors and a score that each competitor reached.</p> <p class="wp-block-paragraph">Now, we apply the <code>qcut()</code> function:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = 3)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td>0</td><td>(0.999, 4.0]</td></tr><tr><td>1</td><td>(4.0, 9.667]</td></tr><tr><td>2</td><td>(9.667, 19.0]</td></tr><tr><td>3</td><td>(0.999, 4.0]</td></tr><tr><td>4</td><td>(4.0, 9.667]</td></tr><tr><td>5</td><td>(9.667, 19.0]</td></tr><tr><td>6</td><td>(4.0, 9.667]</td></tr><tr><td>7</td><td>(0.999, 4.0]</td></tr><tr><td>8</td><td>(9.667, 19.0]</td></tr></tbody></table></figure> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td><code>Name: Score, dtype: category</code></td></tr><tr><td><code>Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]</code></td></tr><tr><td> </td></tr></tbody></table></figure> <p class="wp-block-paragraph">Inside the function, we put in “<code>df['Score']</code>” as the value for the parameter “<code>x</code>” to state that this is the column that we want to use to calculate the bins on. The second argument is “3” which we assign to the “<code>q</code>” parameter. This is the number of quantiles.</p> <p class="wp-block-paragraph">The output assigns each score to an interval. There are a few things to observe here.</p> <p class="wp-block-paragraph">First, we can see at the bottom of the output the intervals in order (“<code>(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]</code>“). The intervals start with parenthesis and end with square brackets. That means that the left value is not included in the interval, but the right one is. For example, “0.999” is not included, whereas “4.0” is included.</p> <p class="wp-block-paragraph">Additionally, we can see that the intervals do not have the same size. The first interval has a size of 3, the second has a size of 5.667 and the third one has a size of 9.333. Why are the intervals these particular sizes?</p> <p class="wp-block-paragraph">To answer that, we have to take a look at the number of values in each interval:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = 3).value_counts()</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td>(0.999, 4.0]</td><td>3</td></tr><tr><td>(4.0, 9.667]</td><td>3</td></tr><tr><td>(9.667, 19.0]</td><td>3</td></tr><tr><td><code>Name: score, dtype: int64</code></td></tr></tbody></table></figure> <p class="wp-block-paragraph">We use the <code>value_counts()</code> function to achieve that. We can see that each bin has an equal amount of values. By assigning “3” to the “<code>q</code>” parameter we state that we want to get three intervals. And each interval should contain just as many values as the others. So, the interval sizes adjust to that.</p> <p class="wp-block-paragraph">To make it better visible which interval belongs to which score, we create a new column for the data frame:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df['Category'] = pd.qcut(x = df['Score'], q = 3) print(df)</pre> <figure class="wp-block-table is-style-stripes"><table><tbody><tr><td></td><td><strong>Competitor</strong></td><td><strong>Score</strong></td><td><strong>Category</strong></td></tr><tr><td>0</td><td>Alice</td><td>1</td><td>(0.999, 4.0]</td></tr><tr><td>1</td><td>Mary</td><td>6</td><td>(4.0, 9.667]</td></tr><tr><td>2</td><td>John</td><td>11</td><td>(9.667, 19.0]</td></tr><tr><td>3</td><td>Ann</td><td>2</td><td>(0.999, 4.0]</td></tr><tr><td>4</td><td>Bob</td><td>9</td><td>(4.0, 9.667]</td></tr><tr><td>5</td><td>Jane</td><td>16</td><td>(9.667, 19.0]</td></tr><tr><td>6</td><td>Tom</td><td>5</td><td>(4.0, 9.667]</td></tr><tr><td>7</td><td>Vincent</td><td>2</td><td>(0.999, 4.0]</td></tr><tr><td>8</td><td>Ella</td><td>19</td><td>(9.667, 19.0]</td></tr></tbody></table></figure> <p class="wp-block-paragraph">We create a new column called “<code>Category</code>” which contains the intervals and we add it to the existing data frame.</p> <h2 class="wp-block-heading">The “q” parameter</h2> <p class="wp-block-paragraph">In the previous example, we set the “<code>q</code>” parameter equal to “3”. Of course, we can also assign other values here. Apart from an integer value, we can assign this parameter a list:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.])</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.999, 2.0] 1 (2.0, 6.0] 2 (6.0, 11.0] 3 (0.999, 2.0] 4 (6.0, 11.0] 5 (11.0, 19.0] 6 (2.0, 6.0] 7 (0.999, 2.0] 8 (11.0, 19.0] Name: Score, dtype: category Categories: (4, interval[float64, right]): [(0.999, 2.0] < (2.0, 6.0] < (6.0, 11.0] < (11.0, 19.0]]</pre> <p class="wp-block-paragraph">This way, we directly determine how many percent of the values are included in each interval. For example, the first interval <em>(0.999, 2.0]</em> contains the first 25% of the score values. Since the intervals we created here all have the same length of 25%, we should get an equal amount of values in each interval.</p> <p class="wp-block-paragraph">Let’s see if that’s the case:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.]).value_counts()</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">(0.999, 2.0] 3 (2.0, 6.0] 2 (6.0, 11.0] 2 (11.0, 19.0] 2 Name: Score, dtype: int64 </pre> <p class="wp-block-paragraph">We make use of the <code>value_counts()</code> function again. As we can see, the first interval contains one value more than the other ones. That’s because we have nine scores in total and nine cannot be divided by four. Consequently, the number of values per interval cannot be the same in all intervals.</p> <p class="wp-block-paragraph">The distance between the quantiles in the array does not have to be even:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.])</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.999, 6.0] 1 (0.999, 6.0] 2 (10.2, 15.0] 3 (0.999, 6.0] 4 (6.0, 10.2] 5 (15.0, 19.0] 6 (0.999, 6.0] 7 (0.999, 6.0] 8 (15.0, 19.0] Name: Score, dtype: category Categories: (4, interval[float64, right]): [(0.999, 6.0] < (6.0, 10.2] < (10.2, 15.0] < (15.0, 19.0]] </pre> <p class="wp-block-paragraph">The first interval is way bigger than the other ones. Thus, the number of values per interval is not evenly distributed:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.]).value_counts()</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">(0.999, 6.0] 5 (15.0, 19.0] 2 (6.0, 10.2] 1 (10.2, 15.0] 1 Name: Score, dtype: int64 </pre> <p class="wp-block-paragraph">As we can observe, the first interval contains the most score values.</p> <h2 class="wp-block-heading">Determine the Interval Precision</h2> <p class="wp-block-paragraph">By now, the intervals we created all had a specific precision:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = 3)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.999, 4.0] 1 (4.0, 9.667] 2 (9.667, 19.0] 3 (0.999, 4.0] 4 (4.0, 9.667] 5 (9.667, 19.0] 6 (4.0, 9.667] 7 (0.999, 4.0] 8 (9.667, 19.0] Name: Score, dtype: category Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]] </pre> <p class="wp-block-paragraph">As we can see, there are three decimal places except for the integer values that only have “<code>.0</code>” as the decimal place.</p> <p class="wp-block-paragraph">We can change that precision using the “<code>precision</code>” parameter. This parameter expects an integer value which determines how many decimal places we want to get.</p> <p class="wp-block-paragraph">Let’s assign “5” here to get five decimal places:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = 3, precision=5)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.99999, 4.0] 1 (4.0, 9.66667] 2 (9.66667, 19.0] 3 (0.99999, 4.0] 4 (4.0, 9.66667] 5 (9.66667, 19.0] 6 (4.0, 9.66667] 7 (0.99999, 4.0] 8 (9.66667, 19.0] Name: Score, dtype: category Categories: (3, interval[float64, right]): [(0.99999, 4.0] < (4.0, 9.66667] < (9.66667, 19.0]] </pre> <p class="wp-block-paragraph">In this manner, we create more precise intervals. How precise we should create them depends on the use case.</p> <h2 class="wp-block-heading">Print out the bins</h2> <p class="wp-block-paragraph">If we want to print out the bins that we created, we apply the “<code>retbins</code>” parameter and set it to “<code>True</code>“:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'],q = 3, retbins=True)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.999, 4.0] 1 (4.0, 9.667] 2 (9.667, 19.0] 3 (0.999, 4.0] 4 (4.0, 9.667] 5 (9.667, 19.0] 6 (4.0, 9.667] 7 (0.999, 4.0] 8 (9.667, 19.0] Name: Score, dtype: category Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]] array([1., 4., 9.66666667, 19.])) </pre> <p class="wp-block-paragraph">The only difference here compared to when we did not apply the “<code>retbins</code>” parameter is the additional line “array” at the bottom of the output. Here, we get the resulting bins inside an array.</p> <p class="wp-block-paragraph">This can be useful especially when we assign the “<code>q</code>” parameter an integer as we did here instead of a list.</p> <h2 class="wp-block-heading">Define labels for the categories</h2> <p class="wp-block-paragraph">We already saw how to create a new column to our data frame to see which score belongs to which interval:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df['Category'] = pd.qcut(x = df['Score'], q = 3) print(df)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""> Competitor Score Category 0 Alice 1 (0.999, 4.0] 1 Mary 6 (4.0, 9.667] 2 John 11 (9.667, 19.0] 3 Ann 2 (0.999, 4.0] 4 Bob 9 (4.0, 9.667] 5 Jane 16 (9.667, 19.0] 6 Tom 5 (4.0, 9.667] 7 Vincent 2 (0.999, 4.0] 8 Ella 19 (9.667, 19.0] </pre> <p class="wp-block-paragraph">This way, we get a great overview of our data. However, assigning the intervals to the scores can be a bit confusing as we do not clearly see what a good score is and what isn’t.</p> <p class="wp-block-paragraph">This is where the “<code>labels</code>” parameter comes into play. We can give each interval a label to categorize our data:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df['Category'] = pd.qcut(x = df['Score'], q = 3, labels=['bad', 'good', 'exceptional']) print(df)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""> Competitor Score Category 0 Alice 1 bad 1 Mary 6 good 2 John 11 exceptional 3 Ann 2 bad 4 Bob 9 good 5 Jane 16 exceptional 6 Tom 5 good 7 Vincent 2 bad 8 Ella 19 exceptional </pre> <p class="wp-block-paragraph">The “<code>labels</code>” parameter expects a list of the labels. We choose the labels <code>"bad"</code>, <code>"good"</code>, and <code>"exceptional"</code>. So, the smallest interval is assigned the label <code>"bad"</code>, the middle interval is assigned the label <code>"good"</code>, and the biggest interval is assigned the label <code>"exceptional"</code>.</p> <p class="wp-block-paragraph">Thus, we can categorize our data in a more user-friendly way.</p> <h2 class="wp-block-heading">Comparison with the cut() function</h2> <p class="wp-block-paragraph">Chances are when you work with the <code>qcut()</code> function, you have come across the <code>cut()</code> function as well.</p> <p class="wp-block-paragraph">In this final section, we will see the difference between the <code>qcut()</code> and the <code>cut()</code> function.</p> <p class="wp-block-paragraph">Let’s refer to our initial example of the <code>qcut()</code> function where we assigned the “<code>q</code>” parameter the value “3”:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = 3)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.999, 4.0] 1 (4.0, 9.667] 2 (9.667, 19.0] 3 (0.999, 4.0] 4 (4.0, 9.667] 5 (9.667, 19.0] 6 (4.0, 9.667] 7 (0.999, 4.0] 8 (9.667, 19.0] Name: Score, dtype: category Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]] </pre> <p class="wp-block-paragraph">We created three quantiles in a way that each interval now contains the same amount of score values:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.qcut(x = df['Score'], q = 3).value_counts()</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">(0.999, 4.0] 3 (4.0, 9.667] 3 (9.667, 19.0] 3 Name: score, dtype: int64 </pre> <p class="wp-block-paragraph">Now we do essentially the same with the <code>cut()</code> function:</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.cut(x = df['Score'], bins = 3)</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">0 (0.982, 7.0] 1 (0.982, 7.0] 2 (7.0, 13.0] 3 (0.982, 7.0] 4 (7.0, 13.0] 5 (13.0, 19.0] 6 (0.982, 7.0] 7 (0.982, 7.0] 8 (13.0, 19.0] Name: Score, dtype: category Categories: (3, interval[float64, right]): [(0.982, 7.0] < (7.0, 13.0] < (13.0, 19.0]] </pre> <p class="wp-block-paragraph">The <code>cut()</code> function does not provide a “<code>q</code>” parameter, instead, it has the “<code>bins</code>” parameter which we also assign the value “3” to create three bins.</p> <p class="wp-block-paragraph">As we can see, the intervals are different from the ones from the <code>qcut()</code> function. Compared to the <code>qcut()</code> function, these intervals all have the same size. They are all six units long.</p> <p class="wp-block-paragraph">However, the number of values in each interval is different:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pd.cut(x = df['Score'], bins = 3).value_counts()</pre> <p class="wp-block-paragraph">Output:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">(0.982, 7.0] 5 (7.0, 13.0] 2 (13.0, 19.0] 2 Name: Score, dtype: int64 </pre> <p class="wp-block-paragraph">Thus, <code>qcut()</code> creates intervals that are not equally long but they all contain the same number of values. Whereas the <code>cut()</code> function creates equal-sized intervals that don’t necessarily have the same number of values in them.</p> <h2 class="wp-block-heading">Summary</h2> <p class="wp-block-paragraph">In this tutorial, we learned about the <code>qcut()</code> function. We saw how to create intervals in several ways, how to determine the interval’s precision, how to label our categories, and we determined the differences to the <code>cut()</code> function.</p> <p class="wp-block-paragraph">For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.</p> <p class="wp-block-paragraph">Happy Coding!</p> <p>The post <a href="https://blog.finxter.com/pandas-qcut-a-simple-guide-with-video/">Pandas qcut() – A Simple Guide with Video</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </article> </main></body></html>