Prerequisites
- Python Fundamentals
- Numpy basics
Learning Outcomes from tutorial
- How structured data can be formed
- Numpy Structured Array – Creation, Assigning data and doing operations
- Creating Structured Datatype (dtype)
- Memory allocation to Structured Arrays
- Record Arrays – How it’s related to the Structured Arrays
- Understanding the requirement of Pandas package
Structured arrays are special forms of NumPy arrays. They store compound and heterogeneous data, unlike normal NumPy arrays that store homogeneous data. You can create a structured array, for example, with the following command: np.dtype({'names':('person_names', 'person_ages', 'is_python_programmer'), 'formats': ('U9', 'i8', 'bool')})
. This structured array would have three columns with three different datatypes as defined in the tuples.
We will discuss the Numpy Structured Arrays in full detail. They form the backbone for the Pandas Dataframe. This article gives you a solid foundation for the Pandas package.
Why Structured Arrays?
Let us imagine a scenario where we have a planet in which only 4 people exist now. The information we know about them is their names, ages, and whether they’re Python programmers. The naive way of storing these data is by using lists.
>>> person_names = ['Alice', 'Chris', 'Bob', 'Priyatham'] >>> person_ages = [42, 29, 42, 25] >>> is_python_programmer = [False, True, False, True]
Alice and Bob are the characters invented in a research paper about cryptography in 1978. They became very famous in Cryptography and Computer Science space. Chris is the founder of Finxter and myself Priyatham.
But if you observe, there is nothing that tells that there is a relation between all the three lists. As you meditate more on this thought, you might arrive at the lap of a list of lists as a solution.
Let us compose all the information of individual persons in separate lists. On binding them all again in a separate list, we have,
>>> Alice_info = ['Alice', 42, False] >>> Chris_info = ['Chris', 29, True] >>> Bob_info = ['Bob', 42, False] >>> Priyatham_info = ['Priyatham', 25, True] >>> planet_info = [Alice_info, Chris_info, Bob_info, Priyatham_info] >>> planet_info [['Alice', 42, False], ['Chris', 29, True], ['Bob', 42, False], ['Priyatham', 25, True]]
The above list assignment can be visualized as follows,
You can see that the internal individual lists are stored at different locations of memory. If we want to access all the names of the people who are on our planet, we should loop through all the lists. This is a very costly process because we need to hop through different memory locations.
Numpy Structured Array can store and make the same data accessible very efficiently. It does so by storing the complete array at the same memory location in a contiguous array. Numpy uses C language API behind the scenes which makes it blazing fast.
How to Construct and Assign Data to Numpy Structured Arrays
Let us first construct normal Numpy arrays like the naive lists and investigate them.
>>> import numpy as np >>> person_names_arr = np.array(person_names) >>> person_ages_arr = np.array(person_ages) >>> is_python_prog_arr = np.array(is_python_programmer) >>> person_names_arr array(['Alice', 'Chris', 'Bob', 'Priyatham'], dtype='<U9') >>> person_ages_arr array([42, 29, 42, 25]) >>> is_python_prog_arr array([False, True, False, True])
Numpy arrays are mainly characterized by their data types. We can access data type using the dtype
attribute of the Numpy array object.
>>> person_names_arr.dtype dtype('<U9') >>> person_ages_arr.dtype dtype('int64') >>> is_python_prog_arr.dtype dtype('bool')
You can see above that each array knows it’s explicit type information and has only a single type.
Numpy Structured Array is created using a special data type (dtype
) called a Structured data type. A Structured data type can have multiple types with names assigned to them respectively.
Let us create a Numpy Structured Array using a Structured data type. We can refer to the above types to create data types in the array.
>>> struct_arr = np.zeros(4, dtype = [('person_names', 'U9'), ('person_ages', 'i8'), ('is_python_programmer', 'bool')]) >>> struct_arr array([('', 0, False), ('', 0, False), ('', 0, False), ('', 0, False)], dtype=[('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', '?')])
The above created empty Structured Array can be interpreted and visualized as,
We can use either the row or column indexes to assign the information of our people to the above Structured Array.
1. Assigning using column indexes:
>>> struct_arr['person_names'] = person_names >>> struct_arr['person_ages'] = person_ages >>> struct_arr['is_python_programmer'] = is_python_programmer >>> struct_arr array([('Alice', 42, False), ('Chris', 29, True), ('Bob', 42, False), ('Priyatham', 25, True)], dtype=[('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', '?')])
2. Assigning using the row indexes:
>>> struct_arr[0] = tuple(Alice_info) >>> struct_arr[1] = tuple(Chris_info) >>> struct_arr[2] = tuple(Bob_info) >>> struct_arr[3] = tuple(Priyatham_info) >>> struct_arr array([('Alice', 42, False), ('Chris', 29, True), ('Bob', 42, False), ('Priyatham', 25, True)], dtype=[('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', '?')])
By following any of the two ways of assignment, Structured Arrays gets filled with our information. This can be interpreted and visualized as,
Data Accessing & Operations on Structured Arrays
Now we can access any element that’s present anywhere in the array very efficiently. We get an added advantage of Structured data type along with normal NumPy array features like aggregations, broadcasting, etc. The same column and row indexes that we used to assign data can be used for accessing the elements in the array.
To get all the names of all the people present in our planet,
>>> struct_arr['person_names'] array(['Alice', 'Chris', 'Bob', 'Priyatham'], dtype='<U9')
To get information present in the first and second rows in the array,
>>> struct_arr[0] ('Alice', 42, False) >>> struct_arr[1] ('Chris', 29, True)
To get the same above information, we can leverage numpy.where( )
function. To do so, we need to exactly know the name of the person about whom we want to retrieve the information. This uses NumPy boolean masking internally.
>>> struct_arr[np.where(struct_arr['person_names'] == 'Alice')] array([('Alice', 42, False)], dtype=[('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', '?')]) >>> struct_arr[np.where(struct_arr['person_names'] == 'Chris')] array([('Chris', 29, True)], dtype=[('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', '?')])
In order to get the names of the last 2 persons, python’s negative index slicing along with Structured Array’s selection can be used.
>>> struct_arr[-2:]['person_names'] array(['Bob', 'Priyatham'], dtype='<U9')
To get the names of the Python programmers on our planet, we again use boolean masking,
>>> struct_arr[struct_arr['is_python_programmer']]['person_names'] array(['Chris', 'Priyatham'], dtype='<U9')
We can see from above that python programmers are less aged than others on our planet. So, let’s get the maximum age of Python programmers and minimum age of non-python programmers. Then we can get an average age using which we can comment about the evolution of the python programming language on our planet.
>>> python_prog_max_age = np.max(struct_arr[struct_arr['is_python_programmer']]['person_ages']) >>> non_python_prog_min_age = np.min(struct_arr[struct_arr['is_python_programmer'] == False]['person_ages']) >>> python_prog_max_age 29 >>> non_python_prog_min_age 42 >>> separation_age = int((python_prog_max_age + non_python_prog_min_age)/2) >>> separation_age 35
Let us say there are some other people we don’t know existed on our planet. But based upon the data we have, before 35 years from now, no or very few python programmers existed on our planet. The Python programming language became popular among young people recently.
If you would like to do more tricky and complicated operations on such data, consider graduating to the Pandas package.
Structured Data Types – Structured Arrays
Have a look at the Array-protocol type strings (‘U9’, ‘i8’, ‘?’) in the above Structured Array. The first character refers to the type of data and the following specifies the number of bytes per each item of that type. Unicode (‘U9’) and boolean (‘?’) are exceptions. In Unicode string type, the following number specifies the number of maximum characters but not bytes. Boolean values (True and False) are the possible outcomes of yes/no questions. As it’s a question, Numpy core developers might’ve given ‘?’ as a type string for boolean values (just my thought).
All the possible type strings used to create NumPy arrays as provided by documentation are;
Character | Description | Example |
‘?’ | Boolean | np.dtype(‘?’) |
‘b’ | Signed Byte | np.dtype(‘b’) |
‘B’ | Unsigned Byte | np.dtype(‘B’) |
‘i’ | Signed integer | np.dtype(‘i8’) |
‘u’ | Unsigned integer | np.dtype(‘u4’) |
‘f’ | Floating point | np.dtype(‘f2’) |
‘c’ | Complex floating point | np.dtype(‘c16’) |
‘m’ | Timedelta | np.dtype(‘m8’) |
‘M’ | Datetime | np.dtype(‘M’) |
‘O’ | Python Objects | np.dtype(‘O’) |
‘S’, ‘a’ | String(zero-terminated) | np.dtype(‘S5’) |
‘U’ | Unicode string | np.dtype(‘U’) |
‘V’ | Raw data (void) | np.dtype(‘V’) |
For other ways of constructing data type objects instead of Array-protocol type strings, please refer to this documentation link.
Three Major ways to create Structured Datatypes
Numpy provides a numpy.dtype
function to create data type objects. We can refer to the above types to create data types. There are 2 major ways of creating Structured data types;
1. Using the dictionary with names and formats as keys (titles)
>>> dt_dict = np.dtype({'names':('person_names', 'person_ages', 'is_python_programmer'), ... 'formats': ('U9', 'i8', 'bool')}) >>> dt_dict dtype([('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', '?')])
The value of the names key is a tuple of column indexes we use in Structured Array. The value of the formats key is a tuple of type strings for the columns respectively.
>>> dt_dict.names ('person_names', 'person_ages', 'is_python_programmer') >>> dt_dict.fields mappingproxy({'person_names': (dtype('<U9'), 0), 'person_ages': (dtype('int64'), 36), 'is_python_programmer': (dtype('bool'), 44)}) >>> dt_dict.itemsize 45 >>> struct_arr.itemsize 45
An item in our Structured Array is the information about a single person on our planet. The memory allocated for a single item is 45 bytes as depicted from itemsize attribute.
If you observe the result of dt_dict.fields, you can see the byte memory allocation and memory distribution over indexes. We know the ‘<U9’ type string refers to a Unicode string of 9 characters. Each character spans over 4 bytes which makes the parent_names field consume 36 bytes of 45 bytes. Parent_ages field consumes 8 bytes and the is_python_programmer consumes 1 byte.
All of this explanation can be visualized using the below figure.
2. Using the list of tuples
>>> dt_tupl = np.dtype([('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmer', 'bool')]) >>> dt_tupl dtype([('person_names', '<U9'), ('person_ages', '<i8'), ('is_python_programmmer', '?')]) >>> dt_tupl.names ('person_names', 'person_ages', 'is_python_programmer')
In this method, a Structured data type is created using a list of tuples. Each tuple consists of an index name and its type.
The result of dt_tupl.names
concludes that the index names will be created from the tuples automatically.
3. Using a string of comma-separated types
>>> dt_str = np.dtype('U9, i8, bool') >>> dt_str dtype([('f0', '<U9'), ('f1', '<i8'), ('f2', '?')]) >>> dt_str.names ('f0', 'f1', 'f2')
When we don’t care about the field names, we can use this type of Structured data type. It automatically allocates some field names ‘f0’, ‘f1’, ‘f2’ …. based on the number of types present.
Record Arrays
Record Arrays are basically Structured Arrays with one additional functionality. Provision to access the named index field as an attribute along with dictionary keys is provided.
>>> rec_arr = np.rec.array(struct_arr) >>> rec_arr['person_names'] array(['Alice', 'Chris', 'Bob', 'Priyatham'], dtype='<U9') >>> rec_arr.person_names array(['Alice', 'Chris', 'Bob', 'Priyatham'], dtype='<U9') >>> rec_arr is struct_arr False >>> rec_arr == struct_arr rec.array([ True, True, True, True], dtype=bool)
The easiest way of creating Record Arrays is by using numpy.rec.array( )
function. The person_names field is accessed as an attribute along with the dictionary key-based index above. Record Array takes in the Structured Array and creates another different object from Structured Array. The result of rec_arr == struct_arr
proves both of them have the same values with its additional feature.
The disadvantage of the Record Array is that it is slower than the Structured Array because of its extra feature.
Next Steps: Graduating to Pandas
Structured Arrays is the effort by the NumPy developers to have an in-home capability to deal with structured data. But, when dealing with Structured Data in the form of tables, a world of extra operations are possible. Pandas is a very mature tool to deal with all such operations. Please consider a leap towards the Pandas package if you’re dealing with any such Structured Data discussed in the article.