Problem Formulation and Solution Overview
Often, a sub-set of data needs to be extracted from a larger dataset. This sub-set could be a pre-determined number of column(s) or row(s). These examples show you how to extract this data.
Preparation
Before moving forward, please ensure the NumPy library is installed on the computer. Click here if you require instructions.
Then, add the following code to the top of each script. This snippet will allow the code in this article to run error-free.
import numpy as np
After importing the NumPy library, we can reference this library by calling the shortcode (np
) as shown above.
Method 1: Use np.array() and slicing
This NumPy method uses sl
icing
to extract a specific subset from a data set. The code below can be used in a production environment with an extensive data set.
data = np.array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]]) subset = data[:, 1:6:2] print(subset)
Above, an np.array()
function is used to declare a 2D (two-dimensional) NumPy array containing a small sampling of integers. This saves to data
.
Next, a subset of the above data
is extracted containing all rows and columns 1, 3, and 5 using slicing (data[:, 1:6:2]
) as follows:
- All rows of
data
are extracted by callingdata[: ]
. - A comma (
,
) is placed to separate the slicing. In this case, to separate row extraction [:] from column extraction[1:6:2]
. - The extraction starts from column 1 to column 5 (stop-1), skipping every 2nd column. Once the stop position (
6-1
) is attained, the slicing is complete and saved tosubset
.
The results are output to the terminal.
[[ 1 3 5] |
Comparing this output to the original np.array()
shows you how easy slicing is! A truly Pythonic approach!
Method 2: use np.array() and np.ix_
This NumPy method uses the np.array()
and np.ix_
functions and slicing to extract a subset of rows and columns from a data set. This option can also be used in a production environment with an extensive data set.
data = np.array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]]) subset = data[np.ix_([2,3], [2,5])] print(subset)
Above, an np.array()
function is used to declare a 2D (two-dimensional) NumPy array containing a small sampling of integers. This saves to data
.
Next, the data is extracted using slicing, saved to subset
and output to the terminal.
[[22 25] |
Step 3: Use np.array() and np.arange()
This NumPy method uses the np.array()
and np.arange()
to extract a subset from a data set. This option only works with a 1D (one-dimensional) NumPy array.
data = np.array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) subset = np.arange(3, 10, 3) print(subset)
Above, an np.array()
function is used to declare a 1D (one-dimensional) NumPy array containing a small sampling of integers. This saves to data
.
Next, np.arange()
is used and passed the following arguments:
- The start position of 3.
- The stop position (stop-1) of 9.
- The step position of 3.
The results are output to the terminal.
[3 6 9] |
Method 4: Use Use np.array()
, np.reshape()
and slicing
This NumPy method uses Use np.array()
, np.reshape()
and slicing to extract a subset from a data set. This option can also be used in a production environment with an extensive data set.
data = np.array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]]) reworked = np.arange(25).reshape(5,5) print(reworked) subset = reworked[:,3] print(subset)
Above, an np.array()
function is used to declare a 2D (two-dimensional) NumPy array containing a small sampling of integers. This saves to data
.
Next, data
is reshaped and output to the terminal to display the new transformation (5 arrays. Each containing 5 elements).
[[ 0 1 2 3 4] |
From the reshaped data
, element [3] is extracted from each array using slicing and then saved to a 1D array subset
and output to the terminal.
[ 3 8 13 18 23] |
Bonus
We have a CSV file containing five (5) sample users from the Finxter Academy. These columns are an ID, puzzles solved correctly and incorrectly. Using NumPy, how could extract a subset of this data?
Contents of scores.csv
30022145,1915,68 |
csv = np.loadtxt('scores.csv', delimiter=',', dtype=int) csv = csv.reshape(3,5) print(csv) subset = csv[:, 4] print(subset)
Above, uses np.loadtxt()
and passes it the following arguments:
- The CSV file to read in. In this case,
scores.csv
. - The field delimiter. In this case, a comma (
,
). - Set the data type to integers (
dtype-int
).
Next, csv.reshape()
is called and passed two (2) arguments:
- The total number of columns in the CSV file (
3
). - The total number of rows in the CSV file (
5
).
Then, the reshaped NumPy array is output to the terminal.
[[30022145 1915 68 30022192 1001] |
However, we want to extract a subset of this data. In this regard, the following line csv[:, 4]
uses slicing to extract the data and save it to subset
.
[ 1001 30022345 47] |