Jonathan Boland, Author at Be on the Right Side of Change

Python shutil: High-Level File Operations Demystified

Jonathan Boland — Wed, 30 Sep 2020 17:49:47 +0000

Are you looking to copy, move, delete, or archive data with your Python programs? If so, you’re in the right place because this article is all about the module that’s been specially designed for the job. It’s called shutil (short for shell utilities) and we’ll be demystifying its key features by way a few simple examples. We’ll also see how to use shutil in combination with some other standard library modules, and cover a few limitations that could cause you a bit of headache depending on your priorities, the operating system you use and your version of Python.

A Word About File Paths

Before we start, it’s worth mentioning that paths are constructed differently depending on your operating system. On Mac and Linux they’re separated by forward slashes (known as Posix style) and on Windows by backslashes.

For the purposes of this article I will be using Windows-style paths to illustrate shutil’s features, but this could just as easily have been done with Posix paths.

The fact that Windows paths use backslashes also leads to another complication because they have a special meaning in Python. They are used as part of special characters and for escaping purposes, which you can read all about in this Finxter backslash article.

You will therefore notice the letter ‘r’ prior to strings in the code snippets – this prefix signifies a raw string in which backslashes are treated as literal rather than special characters. The other way to handle this issue is by using a second a backslash to escape the first, which is the format Python uses to display the Windows path of a new file that’s been created.

As an aside, when using paths in your real-world programs I would highly recommend defining them with pathlib.Path(). If done correctly, this has the effect of normalizing paths so they work regardless of the operating system the program is running on.

shutil Directory and File Operations

shutil copy

So, let’s kick things off with a simple example of how to copy a single file from one folder to another.

There’s no need to pip install anything because shutil is in Python’s standard library; just import the module and you’re ready to go:

 >>> import shutil
 >>> source = r'C:\src_folder\blueprint.jpg'
 >>> destination = r'C:\dst_folder'
 >>> shutil.copy(source, destination)
 
 'C:\\dst_folder\\blueprint.jpg'

shutil.copy() places a duplicate of the specified source file in the destination folder you have defined, and Python confirms the path to the file. The file’s permissions are copied along with the data.Another option is to specify a destination file instead of a destination folder:

 ...
 >>> source = r'C:\src_folder\blueprint.jpg'
 >>> destination = r'C:\dst_folder\plan.jpg'
 >>> shutil.copy(source, destination)
 
 'C:\\dst_folder\\plan.jpg'

In this instance, a copy of the source file will still be placed in the destination folder but its name will be changed to the one that’s been provided.

WARNING: Regardless of whether you copy a file directly to a folder preserving its existing name or provide a destination file name, if a file already exists in the destination folder with that name copy() will permanently overwrite it without warning you first.

This could be useful if you’re intentionally looking to update or replace a file, but might cause major problems if you forget there’s another file in the location with that name that you want to keep!

shutil copy2

copy2() works in the same way as copy() except that in addition to file permissions it also attempts to preserve metadata such as the last time the file was modified.

There are a few limitations to this, which you can read about in the Missing File Metadata section later in this article.

shutil copytree

If copying files one-by-one isn’t going to cut it, copytree() is the way to go.

 ...
 >>> source = r'C:\src_folder\directory'
 >>> destination = r'C:\dst_folder\directory_copy'
 >>> shutil.copytree(source, destination)
 
 'C:\\dst_folder\\directory_copy'

copytree() creates a duplicate of the entire source directory and gives it the name you specify in the destination path. It uses copy2() to copy files by default so will attempt to preserve metadata, but this can be overridden by setting the copy_function parameter.Unlike when copying individual files, if a directory with the same name already exists in that destination (in this case directory_copy), an error will be raised and the directory tree will not be copied. So, when attempting to complete the same copytree operation for a second time this is an abridged version of what we see:

 ...
 FileExistsError: [WinError 183] Cannot create a file when that file already  
 exists: 'C:\\dst_folder\\directory_copy'

Accidentally overwriting an entire directory could be pretty catastrophic, and this safeguard has no doubt prevented many such incidents over the years. It’s also caused a fair amount of frustration though, because until very recently there was no straight forward way to override it.

If replacing an existing directory IS what you want to do a new option was introduced in Python 3.8 that make this possible:

 ...
 >>> shutil.copytree(source, destination, dirs_exist_ok=True)
 
 'C:\\dst_folder\\directory_copy'

The dirs_exist_ok parameter is set to False by default, but changing it to True overrides the usual behavior and allows us to complete our copytree() operation for a second time even though directory_copy already exists in the specified location.Another handy feature is the ignore parameter:

 from shutil import copytree, ignore_patterns
 
 >>> src = r'C:\src_folder\another_directory'
 >>> dst = r'C:\dst_folder\another_directory_copy'
 >>> shutil.copytree(src, dst, ignore=ignore_patterns('*.txt', 'discard*'))
 
 'C:\\dst_folder\\another_directory_copy'

ignore allows you to specify files and folders to leave out when a directory is copied.

The simplest way to achieve this is by importing shutil’s ignore_patterns helper function, which can then be passed to copytree’s ignore parameter.

ignore_patterns takes one or more patterns in string format, and any files or folders matching them will be passed over when copytree() creates the new version of the directory.

For example, in the above code snippet we have passed two arguments to ignore_patterns: '*.txt' and 'discard*'. The asterisk (* symbol) acts as a wildcard that matches zero or more characters, so these patterns will ensure that copytree() duplicates everything except files that end with .txt and files or folders that start with discard.This can be seen by viewing the file structure of another_directory:

 C:\src_folder>tree /F
 ...
 C:.
 └───another_directory
     ├───discard_this_folder
     ├───include_this_folder
     │       discard_this_file.docx
     │       include_this_file.docx
     │       include_this_file_too.docx
     │       this_file_will_be_discarded.txt
     │       this_file_will_not_be_discarded.pdf
     │
     └───include_this_folder_too

And then looking at the file structure of another_directory_copy once it’s been created by shutil:

C:\dst_folder>tree /F
 ...
 C:.
 └───another_directory_copy
     ├───include_this_folder
     │       include_this_file.docx
     │       include_this_file_too.docx
     │       this_file_will_not_be_discarded.pdf
     │
     └───include_this_folder_too

shutil move

move() works in a similar way to copy2() but lets you transfer a file to another location instead of copying it.

You can also move an entire directory by specifying a folder for it to be placed in:

 import shutil
 
 
 >>> source = r'C:\src_folder\diagrams'
 >>> destination = r'C:\dst_folder'
 >>> shutil.move(source, destination)
 
 'C:\\dst_folder\\diagrams'

Alternatively, you can provide a new name for the directory as part of the process:

 ...
 >>> source = r'C:\src_folder\diagrams'
 >>> destination = r'C:\dst_folder\layouts'
 >>> shutil.move(source, destination)
 
 'C:\\dst_folder\\layouts'

Unlike copy() and copy2(), move() will raise an exception if a file with the same name already exists in the given folder (unless it’s not on the current file system). This behavior can also be observed when moving directories. Having moved our diagrams directory and renamed it layouts, if we now try to move another directory called layouts into the same location we will see the following:

...
 >>> source = r'C:\src_folder\layouts'
 >>> destination = r'C:\dst_folder'
 >>> shutil.move(source, destination) 
 ...
 shutil.Error: Destination path 'C:\dst_folder\layouts' already exists

WARNING: However, as with the copy functions, when moving individual files, if you include a destination file name and a file with that name already exists in the destination folder, move() will permanently overwrite it without warning you first:

...
 >>> source = r'C:\src_folder\sketch.jpg'
 >>> destination = r'C:\dst_folder\design.jpg'
 >>> shutil.move(source, destination)
 
 'C:\\dst_folder\\design.jpg'
 
 >>> source = r'C:\src_folder\different_sketch.jpg'
 >>> destination = r'C:\dst_folder\design.jpg'
 >>> shutil.move(source, destination)
 
 'C:\\dst_folder\\design.jpg'

There is another subtle gotcha to look out for when using move() that has the potential to cause problems too:

...
 >>> source = r'C:\src_folder\blueprint.jpg'
 >>> destination = r'C:\dst_folder\plan'
 >>> shutil.move(source, destination)
 
 'C:\\dst_folder\\plan'

On this occasion we have tried to transfer a file into a folder that doesn’t exist. Instead of raising an exception, move() has completed the operation and given the file the name of the non-existent directory (plan) without a file extension. The file is still in JPEG format, but it won’t be called what we expect, and the file system will no longer recognize it!

The same kind of problem could occur if we accidentally missed off the file extension from a destination file name as well.

This issue might also crop up when using the copy functions if you’re not careful. In that case you would at least have the original file for reference, but it could still lead to significant confusion.

shutil rmtree

If you want to delete an entire directory instead of moving or copying it, you can do this with rmtree():

 
 import shutil
 >>> shutil.rmtree(r'C:\dst_folder\directory_copy')

By default, rmtree() will raise an exception and halt the process if an error is encountered when attempting to remove files. You can see an example of one of these error messages below:

 ...
 PermissionError: [WinError 32] The process cannot access the file because 
 it is being used by another process: 
 'C:\\dst_folder\\directory_copy\\blueprint.pdf'

However, this behavior can be overridden:

 ...
 >>> shutil.rmtree(r'C:\dst_folder\directory_copy', ignore_errors=True)

If you set the ignore_errors parameter to True, rmtree() will continue to delete the directory instead of raising an exception.

WARNING: Directory trees removed by rmtree() are permanently deleted, so you need to be very careful about how you use it. If you’re concerned by the potential risks (and I wouldn’t blame you if you were!), you might want to consider using a safer alternative such as Send2Trash.

shutil archive

You can use shutil to create directory archives as well:

 ...
 >>> shutil.make_archive(
         r'C:\dst_folder\zipped_designs', 
         'zip', 
         r'C:\src_folder\designs',
         )
 
 'C:\\dst_folder\\zipped_designs.zip'

As shown above, a simple way to do this is by passing three arguments to the make_archive() function:

The path where the new archive should be created, including its name but without the file extension.
The archive format to use when creating it.
The path of the directory to be archived.

The directory will remain unaltered in its original place, and the archive will be created in the specified location.

make_archive() can also create archives in the .tar, .gztar, .bztar or .xztar formats.

For operations more sophisticated than archiving an entire directory, like zipping selected files from a directory based on filters, you can use the zipfile module instead.

shutil Limitations

You can achieve a great deal with the shutil module, but, as mentioned at the start of this article, it does have a few limitations that you should know about.

Missing File Metadata

copy2() preserves as much metadata as possible and is used by copytree() and move() so by default these methods will do the same. It’s not able to capture everything though.

On Windows: file owners, access control lists (ACLs) and alternative data streams are not copied.

File owners and ACLs are also lost on Linux and Mac, along with groups.

On Mac OS the resource fork and other metadata are not used either, resulting in the loss of resource data and incorrect creator and file type codes.

Speed

A complaint often levelled at shutil in the past was that it could be very slow to use when working with large amounts of data, particularly on Windows.

Fortunately, this has been addressed in Python 3.8 with the introduction of the snappily titled platform-dependent efficient copy operations.

This “fast-copy” enhancement means that shutils copy and move operations are now optimized to occur within the relevant operating system kernel instead of Python’s userspace buffers whenever possible.

Therefore, if you’re running into speed issues on an earlier version of Python and using 3.8 instead is an option, it’s likely to improve matters greatly.

You could also look into third-party packages such as pyfastcopy.

Combining Shutil With Other Standard Library Modules

In the copytree() section of this article we saw how to exert greater control over shutil’s behavior by using the ignore parameter to exclude files with a particular name or type.

But what if you want to carry out more complex tasks such as accessing other file-related data so you can check it to determine which operations should be completed?

Using shutil in combination with some of Python’s other standard library modules is the answer.

This section is intended to provide an example of one use case for this kind of approach.

We will create a simple program that can spring clean a file directory by storing away old subdirectories if they haven’t been modified for a long time.

To do this we’ll use shutil.move() along with several other handy modules including: pathlib (which I mentioned at the start), os and time.

The Modules

As well as making it much simpler to define cross platform compatible paths, pathlib’s Path class contains methods that really help with handling file paths efficiently.

We’ll also be using the os module’s walk function, which has no equivalent in pathlib. This will enable us to traverse our subdirectories to identify all the files they contain and extract their paths.

We will take advantage of the time module too, so we can calculate how long it’s been since the files in each subdirectory where last modified.

Preparing for the Move

Having imported our modules:

 import os
 import pathlib
 import shutil
 import time

The first thing we need to do is assign the normal number of seconds in a year to a constant:

SECONDS = 365 * 24 * 60 * 60

This will help us to determine how long it’s been since the files in our subfolders were last modified (more on that later).

Next, we define our first function which will prepare the file operations that are necessary to complete the move:

 ...
 def prepare_move(number, path, storage_folder):
     pass

Our function takes three arguments:

number – the number of years since any file in a subfolder was last modified (this could also be a float such as 1.5).
path – the file path of the main directory that contains the subdirectories we want to tidy up.
storage_folder – the name of the folder where we want the old directories to be placed. Once the operation is complete, this storage folder will be put in the main directory alongside the subdirectories that haven’t been moved.

We now need to assign some objects to variables that will play important roles in the preparation process:

 ...
 def prepare_move(number, path, storage_folder):
     length = SECONDS * number
     now = time.time()
     my_directory = pathlib.Path(path)
     my_subdirectories = (item for item in my_directory.iterdir() if item.is_dir())

length – is the result of multiplying the SECONDS constant we previously defined by the number of years passed into the function.
now – is the current time in seconds provided by the time module. This is calculated based on what’s known as the epoch.
my_directory – stores the main directory path we passed to the function as a pathlib.Path object.
my_subdirectories – is a generator containing the paths of our subdirectories produced by iterating through my_directory.

Our next step is to create a for loop to iterate through the subdirectories yielded by our generator and append the details of any that have not been modified during the period we specified to a list of file operations:

 ...
 def prepare_move(number, path, storage_folder):
     length = SECONDS * number
     now = time.time()
     my_directory = pathlib.Path(path)
     my_subdirectories = (item for item in my_directory.iterdir() if item.is_dir())
     file_operations = []
     for subdirectory in my_subdirectories:
         time_stats = _get_stats(subdirectory)

The first task carried out by the loop is to create a list of all the file modified times in a subdirectory.

This is handled by a separate function which uses the os walk method mention earlier and the last modified value in seconds (st_mtime) available via the Path.stat() utility:

 ...
 def _get_stats(subdirectory):
     time_stats = []
     for folder, _, files in os.walk(subdirectory):
         for file in files:
             file_path = pathlib.Path (folder) / file
             time_stat = file_path.stat().st_mtime
             time_stats.append(time_stat)
     return time_stats

The loop then checks these file modified stats to see whether they all precede the specified point in time (with the calculation being done in seconds).

If so, the necessary source and destination paths are constructed and appended to the file_operations list.

Once the loop has iterated through all our subdirectories, the function returns the list of file operations that need to be completed:

 ...
 def prepare_move(number, path, storage_folder):
     length = SECONDS * number
     now = time.time()
     my_directory = pathlib.Path(path)
     my_subdirectories = (item for item in my_directory.iterdir() if item.is_dir())
     file_operations = []
     for subdirectory in my_subdirectories:
         time_stats = _get_stats(subdirectory)
         if all(time_stat < (now - length) for time_stat in time_stats):
             *_, subdirectory_name = subdirectory.parts
             source = subdirectory
             destination = my_directory / storage_folder / subdirectory_name
             file_operations.append((source, destination))
     return file_operations

Moving the Subdirectories

Now we need to define the function that will actually move the file:

 ...
 def move_files(file_operations):
     for operation in file_operations:
         source, destination = operation
         shutil.move(source, destination)

Because all the preparation work has already been done, this function simply accepts the file operations and passes them to shutil.move() via a for loop so each old subdirectory can be placed in the specified storage_folder.

Executing the Program

Lastly, we define a main() function to execute the program and call it with our arguments:

 ...
 def main(number, path, storage_folder):
     file_operations = prepare_move(number, path, storage_folder)
     move_files(file_operations)
 
 main(1, r"F:\my_directory", "old_stuff")

Here’s the whole program:

 
 import os
 import pathlib
 import shutil
 import time
 
 
 SECONDS = 365 * 24 * 60 * 60
 
 
 def prepare_move(number, path, storage_folder):
     length = SECONDS * number
     now = time.time()
     my_directory = pathlib.Path(path)
     my_subdirectories = (item for item in my_directory.iterdir() if item.is_dir())
     file_operations = []
     for subdirectory in my_subdirectories:
         time_stats = _get_stats(subdirectory)
         if all(time_stat < (now - length) for time_stat in time_stats):
             *_, subdirectory_name = subdirectory.parts
             source = subdirectory
             destination = my_directory / storage_folder / subdirectory_name
             file_operations.append((source, destination))
     return file_operations
 
 
 def _get_stats(subdirectory):
     time_stats = []
     for folder, _, files in os.walk(subdirectory):
         for file in files:
             file_path = pathlib.Path (folder) / file
             time_stat = file_path.stat().st_mtime
             time_stats.append(time_stat)
     return time_stats
 
 
 def move_files(file_operations):
     for operation in file_operations:
         source, destination = operation
         shutil.move(source, destination)
 
 
 def main(number, path, storage_folder):
     file_operations = prepare_move(number, path, storage_folder)
     move_files(file_operations)
 
 main(1, r"F:\my_directory", "old_stuff")

You can see how the directory structure looked before running the program below:

 F:\my_directory>tree /F
 ...
 F:.
 ├───new_files_1
 │   │   new_file.jpg
 │   │
 │   ├───second_level_folder_1
 │   │       really_new_file.txt
 │   │
 │   └───second_level_folder_2
 │           very_new_file.txt
 │
 ├───new_files_2
 │       fairly_new_file.txt
 │
 ├───old_files_1
 │   │   old_file.txt
 │   │
 │   └───second_level_folder_1
 │       │   old_file_as_well.txt
 │       │
 │       └───third_level_folder
 │               really_old_file.jpg
 │
 └───old_files_2
     │   another_old_file.txt
     │
     └───old_second_level_folder
             oldest_file.jpg
             old_file_2.txt

And this is what it looks like afterwards:

 
 F:\my_directory>tree /F
 ...
 F:.
  ├───new_files_1
  │   │   new_file.jpg
  │   │
  │   ├───second_level_folder_1
  │   │       really_new_file.txt
  │   │
  │   └───second_level_folder_2
  │           very_new_file.txt
  │
  ├───new_files_2
  │       fairly_new_file.txt
  │
  └───old_stuff
      ├───old_files_1
      │   │   old_file.txt
      │   │
      │   └───second_level_folder_1
      │       │   old_file_as_well.txt
      │       │
      │       └───third_level_folder
      │               really_old_file.jpg
      │
      └───old_files_2
          │   another_old_file.txt
          │
          └───old_second_level_folder
                  oldest_file.jpg
                  old_file_2.txt

Obviously, if you had a directory this small or one where all the subdirectories were labelled as either old or new already, you would be unlikely to need such a program! But hopefully this basic example helps to illustrate how the process would work with a larger, less intuitive directory.

The program shown in this section has been greatly simplified for demonstration purposes. If you would like to see a more complete version, structured as a command line application that summarizes changes before you decide whether to apply them, and enables you to tidy files based on creation and last accessed times as well, you can view it here.

Final Thoughts

As we’ve seen, the shutil module provides some excellent utilities for working with files and directories, and you can greatly enhance their power and precision by combining them with other tools from the standard library and beyond.

Care should be taken to avoid permanently overwriting or deleting existing files and directories by accident though, so please check out the warnings included in the relevant sections of this article if you haven’t already.

The example program described above is just one of many uses to which shutil’s tools could be put. Here’s hoping you find some ingenious ways to apply them in your own projects soon.

The post Python shutil: High-Level File Operations Demystified appeared first on Be on the Right Side of Change.

Python List: Remove Duplicates and Keep the Order

Jonathan Boland — Wed, 09 Sep 2020 18:57:19 +0000

Removing duplicates from a list is pretty simple. You can do it with a Python one-liner:

>>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
>>> result = list(set(initial))
>>> result
[1, 7, 9, 6]

Python set elements have to be unique so converting a list into a set and back again achieves the desired result.

What if the original order of the list is important though? That makes things a bit more complicated because sets are unordered, so once you’ve finished the conversion the order of the list will be lost.

Fortunately, there are several ways to overcome this issue. In this article we’ll look at a range of different solutions to the problem and consider their relative merits.

Method 1 – For Loop

A basic way to achieve the required result is with a for loop:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = []
 >>> for item in initial:
         if item not in result:
             result.append(item)
 >>> result
 
 [1, 9, 6, 7]

This approach does at least have the advantage of being easy to read and understand. It’s quite inefficient though as the not in check is being completed for every element of the initial list.

That might not be a problem with this simple example, but the time overhead will become increasingly evident if the list gets very large.

Method 2 – List Comprehension

One alternative is to use a list comprehension:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = []
 >>> [result.append(item) for item in initial if item not in result]
 [None, None, None, None]
 >>> result
 
 [1, 9, 6, 7]

List comprehensions are handy and very powerful Python tools that enable you to combine variables, for loops and if statements. They make it possible to create a list with a single line of code (but you can split them into multiple lines to improve readability too!).

Although shorter and still fairly clear, using a list comprehension in this instance is not a very good idea.

That’s because it takes the same inefficient approach to membership testing that we saw in Method 1. It also relies on the side effects of the comprehension to build the result list, which many consider to be bad practice.

To explain further, even if it’s not assigned to a variable for later use, a list comprehension still creates a list object. So, in the process of appending items from the initial list to the result list, our code is also creating a third list containing the return value of each result.append(item) call.

Python functions return the value None if no other return value is specified, meaning that (as you can see above) the output from the third list is:

[None, None, None, None]

A for loop is clearer and does not rely on side effects so is the better method of the two on this occasion.

Method 3 – Sorted Set

We can’t simply convert our list to a set to remove duplicates if we want to preserve order. However, using this approach in conjunction with the sorted function is another potential way forward:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = sorted(set(initial), key=initial.index)
 >>> result
 
 [1, 9, 6, 7]

As you can see, this method uses the index of the initial list to sort the set of unique values in the correct order.

The problem is that although it’s pretty easy to understand it’s not much faster than the basic for loop shown in Method 1.

Method 4 – Dictionary fromkeys()

A seriously quick approach is to use a dictionary:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = list(dict.fromkeys(initial))
 >>> result
 
 [1, 9, 6, 7]

Like sets, dictionaries use hash tables, which means they are extremely fast.

Python dictionary keys are unique by default so converting our list into a dictionary will remove duplicates automatically.

The dict.fromkeys() method creates a new dictionary using the elements from an iterable as the keys.

Once this has been done with our initial list, converting the dictionary back to a list gives the result we’re looking for.

Dictionaries only became ordered in all python implementations when Python 3.7 was released (this was also an implementation detail of CPython 3.6).

So, if you’re using an older version of Python, you will need to import the OrderedDict class from the collections package in the standard library instead:

 >>> from collections import OrderedDict
 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = list(OrderedDict.fromkeys(initial))
 >>> result
 
 [1, 9, 6, 7]

This approach might not be as fast as using a standard dictionary, but it’s still very speedy!

Exercise: Run the code. Does it work?

Method 5 – more-itertools

Up to this point, we’ve only looked at lists containing immutable items. But what if your list contains mutable data types such as lists, sets or dictionaries?

It’s still possible to use the basic for loop shown in Method 1, but that won’t cut the mustard if speed is of the essence.

Also, if we try to use dict.fromkeys() we’ll receive a TypeError because dictionary keys must be hashable.

A great answer to this conundrum comes in the form of a library called more-itertools. It’s not part of the Python standard library so you’ll need to pip install it.

With that done, you can import and use its unique_everseen() function like so:

 >>> from more_itertools import unique_everseen
 >>> mutables = [[1, 2, 3], [2, 3, 4], [1, 2, 3]]
 >>> result = list(unique_everseen(mutables))
 >>> result
 
 [[1, 2, 3], [2, 3, 4]]

The library more-itertools is designed specifically for working with Python’s iterable data types in efficient ways (it complements itertools which IS part of the standard library).

The function unique_everseen() yields unique elements while preserving order and crucially it can handle mutable data types, so it’s exactly what we’re looking for.

The function also provides a way to remove duplicates even more quickly from a list of lists:

 ...
 >>> result = list(unique_everseen(mutables, key=tuple))
 >>> result
 
 [[1, 2, 3], [2, 3, 4]]

This works well because it converts the unhashable lists into hashable tuples to speed things up further.

If you want to apply this trick to a list of sets, you can use frozenset as the key:

 ...
 >>> mutables = [{1, 2, 3}, {2, 3, 4}, {1, 2, 3}]
 >>> result = list(unique_everseen(mutables, key=frozenset))
 >>> result
 
 [{1, 2, 3}, {2, 3, 4}]

Specifying a key with a list of dictionaries is a little more complicated, but can still be achieved with the help of a lambda function:

 ...
 >>> mutables = [{'one': 1}, {'two': 2}, {'one': 1}]
 >>> result = list(
     unique_everseen(mutables, key=lambda x: frozenset(x.items()))
     )
 >>> result
 
 [{'one': 1}, {'two': 2}]

The function unique_everseen() can also be used with lists containing a mix of iterable and non-iterable items (think integers and floats), which is a real bonus. Attempting to provide a key in this instance will result in a TypeError though.

Method 6 – NumPy unique()

If you’re working with numerical data, the third-party library numpy is an option too:

 >>> import numpy as np
 >>> initial = np.array([1, 1, 9, 1, 9, 6, 9, 7])
 >>> _, idx = np.unique(initial, return_index=True)
 >>> result = initial[np.sort(idx)]
 >>> result
 
 [1 9 6 7]

The index values of the unique items can be stored by using the np.unique() function with the return_index parameter set to True.

These can then be passed to np.sort() to produce a correctly ordered slice with duplicates removed.

Technically this method could be applied to a standard list by first converting it into a numpy array and then converting it back to list format at the end. However, this would be an overcomplicated and inefficient way of achieving the result.

Using these kinds of techniques only really makes sense if you are also utilizing some of numpy’s powerful features for other reasons.

Method 7 – pandas unique()

Another third-party library we could use is pandas:

 >>> import pandas as pd
 >>> initial = pd.Series([1, 1, 9, 1, 9, 6, 9, 7])
 >>> result = pd.unique(initial)
 >>> result
 
 [1 9 6 7]

pandas is better suited to the task because it preserves order by default and pd.unique() is significantly faster than np.unique().

As with the numpy method, it would be perfectly possible to convert the result to a standard list at the end.

Again though, unless you’re employing the amazing data analysis tools provided by pandas for another purpose, there is no obvious reason to choose this approach over the even faster option utilizing Python’s built-in dictionary data type (Method 4).

Summary

As we’ve seen, there are a wide range of ways to solve this problem and the decision about which one to select should be driven by your particular circumstances.

If you’re writing a quick script and your list isn’t huge, you may opt to use a simple for loop for the sake of clarity.

However, if efficiency is a factor and your lists don’t contain mutable items then going with dict.fromkeys() is an excellent option. It’s great that this method uses one of Python’s built-in data types and retains a good level of readability while massively improving on the for loop’s speed.

Alternatively, if you’re using an older version of Python, OrderedDict.fromkeys() is a really good choice as it’s still very fast.

If you need to work with lists that contain mutable items, importing more-itertools so you can take advantage of the brilliant unique_everseen() function makes a lot of sense.

Lastly, if you’re doing some serious number crunching with numpy or manipulating data with pandas, it would probably be wise to go with the methods built into those tools for this purpose.

The choice is of course yours, and I hope this article has provided some useful insights that will help you pick the right approach for the job at hand.

The post Python List: Remove Duplicates and Keep the Order appeared first on Be on the Right Side of Change.

Python String Formatting: How to Become a String Wizard with the Format Specification Mini-Language

Jonathan Boland — Mon, 31 Aug 2020 10:41:04 +0000

Python provides fantastic string formatting options, but what if you need greater control over how values are presented? That’s where format specifiers come in.

This article starts with a brief overview of the different string formatting approaches. We’ll then dive straight into some examples to whet your appetite for using Python’s Format Specification Mini-Language in your own projects.

But before all that—let’s play with string formatting yourself in the interactive Python shell:

Exercise: Create another variable tax and calculate the tax amount to be paid on your income (30%). Now, add both values income and tax in the string—by using the format specifier %s!

Don’t worry if you struggle with this exercise. After reading this tutorial, you won’t! Let’s learn everything you need to know to get started with string formatting in Python.

String Formatting Options

Python’s string formatting tools have evolved considerably over the years.

The oldest approach is to use the % operator:

>>> number = 1 + 2
>>> 'The magic number is %s' % number
'The magic number is 3'

(The above code snippet already includes a kind of format specifier. More on that later…)

The str.format() method was then added:

>>> 'The magic number is {}'.format(number)
'The magic number is 3'

Most recently, formatted string literals (otherwise known as f-strings) were introduced. F-strings are easier to use and lead to cleaner code, because their syntax enables the value of an expression to be placed directly inside a string:

>>> f'The magic number is {number}'
'The magic number is 3'

Other options include creating template strings by importing the Template class from Python’s string module, or manually formatting strings (which we’ll touch on in the next section).

If this is all fairly new to you and some more detail would be helpful before moving on, an in-depth explanation of the main string formatting approaches can be found here.

Format Specifiers

With that quick summary out of the way, let’s move on to the real focus of this post – explaining how format specifiers can help you control the presentation of values in strings.

F-strings are the clearest and fastest approach to string formatting, so I will be using them to illustrate the use of format specifiers throughout the rest of this article. Please bear in mind though, that specifiers can also be used with the str.format() method. Also, strings using the old % operator actually require a kind of format specification – for example, in the %s example shown in the previous section the letter s is known as a conversion type and it indicates that the standard string representation of the object should be used.

So, what exactly are format specifiers and what options do they provide?

Simply put, format specifiers allow you to tell Python how you would like expressions embedded in strings to be displayed.

Percentage Format and Other Types

For example, if you want a value to be displayed as a percentage you can specify that in the following way:

>>> asia_population = 4_647_000_000
>>> world_population = 7_807_000_000
>>> percent = asia_population / world_population
>>> f'Proportion of global population living in Asia: {percent:.0%}'
'Proportion of global population living in Asia: 60%'

What’s going on here? How has this formatting been achieved?

Well the first thing to note is the colon : directly after the variable percent embedded in the f-string. This colon tells Python that what follows is a format specifier which should be applied to that expression’s value.

The % symbol defines that the value should be treated as a percentage, and the .0 indicates the level of precision which should be used to display it. In this case the percentage has been rounded up to a whole number, but if .1 had been specified instead the value would have been rounded to one decimal place and displayed as 59.5%; using .2 would have resulted in 59.52% and so on.

If no format specifier had been included with the expression at all the value would have been displayed as 0.5952350454720123, which is far too precise!

(The % symbol applied in this context should not be confused with the % operator used in old-style string formatting syntax.)

Percentage is just the tip of the iceberg as far as type values are concerned, there are a range of other types that can be applied to integer and float values.

For example, you can display integers in binary, octal or hex formats using the b, o and x type values respectively:

>>> binary, octal, hexadecimal = [90, 90, 90]
>>> f'{binary:b} - {octal:o} - {hexadecimal:x}'
'1011010 - 132 - 5a'

For a full list of options see the link to the relevant area of the official Python documentation in the Further Reading section at the end of the article.

Width Format, Alignment and Fill

Another handy format specification feature is the ability to define the minimum width that values should take up when they’re displayed in strings.

To illustrate how this works, if you were to print the elements of the list shown below in columns without format specification, you would get the following result:

>>> python, java, p_num, j_num = ["Python Users", "Java Users", 8.2, 7.5]
>>> print(f"|{python}|{java}|\n|{p_num}|{j_num}|")
|Python Users|Java Users|
|8.2|7.5|

Not great, but with the inclusion of some width values matters start to improve:

>>> print(f"|{python:16}|{java:16}|\n|{p_num:16}|{j_num:16}|")
|Python Users    |Java Users      |
|             8.2|             7.5|

As you can see, width is specified by adding a number after the colon.

The new output is better, but it seems a bit strange that the titles are aligned to the left while the numbers are aligned to the right. What could be causing this?

Well, it’s actually to do with Python’s default approach for different data types. String values are aligned to the left as standard, while numeric values are aligned to the right. (This might seem slightly odd, but it’s consistent with the approach taken by Microsoft Excel and other spreadsheet packages.)

Fortunately, you don’t have to settle for the default settings. If you want to change this behavior you can use one of the alignment options. For example, focusing on the first column only now for the sake of simplicity, if we want to align the number to the left this can be done by adding the < symbol before the p_num variable’s width value:

>>> print(f"|{python:16}|\n|{p_num:<16}|")
|Python Users    |
|8.2             |

And the reverse can just as easily be achieved by adding a > symbol in front of the width specifier associated with the title value:

>>> print(f"|{python:>16}|\n|{p_num:16}|")
|    Python Users|
|             8.2|

But what if you want the rows to be centered? Luckily, Python’s got you covered on that front too. All you need to do is use the ^ symbol instead:

>>> print(f"|{python:^16}|\n|{p_num:^16}|")
|  Python Users  |
|      8.2       |

Python’s default fill character is a space, and that’s what has so far been used when expanding the width of our values. We can use almost any character we like though. It just needs to be placed in front of the alignment option. For example, this is what the output looks like when an underscore is used to fill the additional space in the title row of our column:

>>> print(f"|{python:_^16}|\n|{p_num:^16}|")
|__Python Users__|
|      8.2       |

It’s worth noting that the same output can be achieved manually by using the str() function along with the appropriate string method (in this case str.center()):

>>> print("|", python.center(16, "_"), "|\n|", str(p_num).center(16), "|", sep="")
|__Python Users__|
|      8.2       |

But the f-string approach is much more succinct and considerably faster to evaluate at run time.

Of course, outputting data formatted into rows and columns is just one example of how specifying width, alignment and fill characters can be used.

Also, in reality if you are looking to output a table of information you aren’t likely to be using a single print() statement. You will probably have several rows and columns to display, which may be constructed with a loop or comprehension, perhaps using str.join() to insert separators etc.

However, regardless of the application, in most instances using f-strings with format specifiers instead of taking a manual approach will result in more readable and efficient code.

24-Hour Clock Display

As another example, let’s say we want to calculate what the time of day will be after a given number of hours and minutes has elapsed (starting at midnight):

>>> hours = 54
>>> minutes = 128
>>> quotient, minute = divmod(minutes, 60)
>>> hour = (hours + quotient) % 24
>>> f'{hour}:{minute}'
'8:8'

So far so good. Our program is correctly telling us that after 54 hours and 128 minute the time of day will be 8 minutes past 8 in the morning, but the problem is that it’s not very easy to read. Confusion could arise about whether it’s actually 8 o’clock in the morning or evening and having a single digit to represent the number of minutes just looks odd.

To fix this we need to insert leading zeros when the hour or minute value is a single digit, which can be achieved using something called sign-aware zero padding. This sounds pretty complicated, but in essence we just need to use a 0 instead of one of the alignment values we saw earlier when defining the f-string, along with a width value of 2:

>>> f'{hour:02}:{minute:02}'
'08:08'

Hey presto! The time is now in a clear 24-hour clock format. This approach will work perfectly for times with double-digit hours and minutes as well, because the width value is a maximum and the zero padding will not be used if the value of either expression occupies the entire space:

>>> hours = 47
>>> minutes = 59
...
>>> f'{hour:02}:{minute:02}'
'23:59'

Grouping Options

The longer numbers get the harder they can be to read without thousand separators, and if you need to insert them this can be done using a grouping option:

>>> proxima_centauri = 40208000000000
>>> f'The closest star to our own is {proxima_centauri:,} km away.'
'The closest star to our own is 40,208,000,000,000 km away.'

You can also use an underscore as the separator if you prefer:

>>> f'The closest star to our own is {proxima_centauri:_} km away.'
'The closest star to our own is 40_208_000_000_000 km away.'

Putting It All Together

You probably won’t need to use a wide variety of format specification values with a single expression that often, but if you do want to put several together the order is important.

Staying with the astronomical theme, for demonstration purposes we’ll now show the distance between the Sun and Neptune in millions of kilometers:

>>> neptune = "Neptune"
>>> n_dist = 4_498_252_900 / 1_000_000
>>> print(f"|{neptune:^15}|\n|{n_dist:~^15,.1f}|")
|    Neptune    |
|~~~~4,498.3~~~~|

As you can see, reading from right to left we need to place the n_dist format specification values in the following order:

Type – f defines that the value should be displayed using fixed-point notation
Precision – .1 indicates that a single decimal place should be used
Grouping – , denotes that a comma should be used as the thousand separator
Width – 15 is set as the minimum number of characters
Align – ^ defines that the value should be centered
Fill – ~ indicates that a tilde should occupy any unused space

In general, format values that are not required can simply be omitted. However, if a fill value is specified without a corresponding alignment option a ValueError will be raised.

Final Thoughts and Further Reading

The examples shown in this article have been greatly simplified to demonstrate features in a straightforward way, but I hope they have provided some food for thought, enabling you to envisage ways that the Format Specification Mini-Language could be applied in real world projects.

Basic columns have been used to demonstrate aspects of format specification, and displaying tabular information as part of a Command Line Application is one example of the ways this kind of formatting could be employed.

If you want to work with and display larger volumes of data in table format though, you would do well to check out the excellent tools provided by the pandas library, which you can read about in these Finxter articles.

Also, if you would like to see the full list of available format specification values they can be found in this section of the official Python documentation.

The best way to really get the hang of how format specifiers work is to do some experimenting with them yourself. Give it a try – I’m sure you’ll have some fun along the way!

The post Python String Formatting: How to Become a String Wizard with the Format Specification Mini-Language appeared first on Be on the Right Side of Change.