How to Write Huge Amounts of Generated Data to a File in Python?

Problem formulation

Sometimes we need to generate massive amounts of data. For example, to perform bootstrapping or jackknifing of our actual data.

To get lots of parameterized dummy data, learn how to use new libraries or adjust the model’s hyperparameters. Or benchmark different solutions or debug and optimize our code.

random.seed(42)
my_bag_of_samples = [random.gauss(mu_sample, sigma_sample) for _ in range(LIST_SIZE)]

Generating this data is expensive, and we have to use a random seed to guarantee reproducibility.

Wouldn’t it be wiser to create this data once and store it for later consultation?

Like using a scratch draft that we won’t need to file but keep on hand until we discard it for good?

We will dynamically generate and save a single list of a million random floats LIST_SIZE = 10 ** 6 for our case scenario.

Since it is an easy data structure to represent in human-readable text, we can quickly dump it into a file. We will skip error checking, directory creation and deletion, and many other best practices for clarity. You’ll have to take my word for it or, better yet, get the code and recreate it locally. Play with it!

For every following example, we’ll implicitly assume these imports and constants:

import random
import os

from finxter_tools import timeit

LIST_SIZE = 10 ** 6

Follow me, and we will see how this can be done simply and easily without resorting to third-party libraries. From here, you will have the basics to organize your workflow as it best suits your needs.

First naive solution

We had never thought of dumping data to disk from our programs. We have no idea how to do it.

And we like to complicate our lives without visiting finxter.com. We get down to work, and we have seen that there is the built-in open() function and that context managers and the with keyword will help us deal with the hassle of opening and closing files. It’s a piece of cake!

@timeit
def generate_huge_list_naive_1():
    random.seed(42)
    # Smelly one-liner to take care of a possible existing file :D
    with open("huge_list_naive_1.txt", "w") as f: f.write("")
    for _ in range(LIST_SIZE):
        with open("huge_list_naive_1.txt", "a") as huge_list_naive_1:
            huge_list_naive_1.write(f"{str(random.gauss(0, 1))},")


generate_huge_list_naive_1()

It works! But it’s a little slow, isn’t it? A little over two minutes…

Execution time of generate_huge_list_naive_1: 132390 ms

Let’s check the filesize:

print(f"{os.path.getsize('huge_list_naive_1.txt') / 2 ** 10:.2f} KB")
# 19172.63 KB

When we need to retrieve our data, we will need to use something similar to this to reconvert the string into a list of floats:

with open("huge_list_naive_1.txt", "r") as f:
    loaded_huge_list = [float(i) for i in f.read()[:-1].split(',')]

print(loaded_huge_list[42])
# 0.11441985746092122

Not bad; it works. But we know we can do better.

Second naive solution

Okay. We have realized that we are opening and closing the file exactly 1,000,001 times.

The good thing is no one was around. So we will take the context manager out of the for loop and open it exactly once. And we don’t need to take care of a preexisting file because we’re opening it in write mode instead of append. Much better!

@timeit
def generate_huge_list_naive_2():
    random.seed(42)
    with open("huge_list_naive_2.txt", "w") as huge_list_naive_2:
        for _ in range(LIST_SIZE):
            huge_list_naive_2.write(f"{str(random.gauss(0, 1))},")


generate_huge_list_naive_2()

Much better, a little shy of 2 seconds from those two minutes!

Execution time of generate_huge_list_naive_2: 1766 ms

And the filesize:

print(f"{os.path.getsize('huge_list_naive_2.txt') / 2 ** 10:.2f} KB")
# 19172.63 KB

To recover our data, we have to do the same as in our former solution:

with open("huge_list_naive_2.txt", "r") as f:
    loaded_huge_list = [float(i) for i in f.read()[:-1].split(',')]

print(loaded_huge_list[42])
# 0.11441985746092122

We know there is room for improvement. We want to generate a list of numbers and save it.

We are committing a file to a string for which we are appending a new number each time. We’re calling write() a million times.

Third naive solution

Thanks to our knowledge, the time has come to become more pythonic and optimize our code.

We are going to create our list as a list comprehension.

After converting it into a string (we like to use repr() better than str() to transform something to be used by the computer and not to be read by a human into text), we’ll save it in our file in a single operation:

@timeit
def generate_huge_list_naive_3():
    random.seed(42)
    with open("huge_list_naive_3.txt", "w") as huge_list_naive_3:
        huge_list_naive_3.write(repr([random.gauss(0, 1) for _ in range(LIST_SIZE)]))


generate_huge_list_naive_3()

Nailed it! We almost cut a third of the time:

Execution time of generate_huge_list_naive_3: 1312 ms

And we need a slight change to read the file now to get rid of the brackets (it’s a little longer because of the list formatting with all those spaces after the commas and the encasing brackets):

print(f"{os.path.getsize('huge_list_naive_3.txt') / 2 ** 10:.2f} KB")
# 20149.20 KB

with open("huge_list_naive_3.txt", "r") as f:
    loaded_huge_list = [float(i) for i in f.read()[1:-1].split(',')]

print(loaded_huge_list[42])
# 0.11441985746092122

This can’t get any better for all we know.

Still, it makes sense to refactor our code and generate the list before opening the file.

If we transform the creation of the required type of list into a function, we will be able to adapt it to our needs and dump it to disk minimizing the risk of introducing bugs.

Thus, we create a function to generate the data and another to save it to disk.

Refactored solution

This is our final solution – or is it?

@timeit
def generate_huge_list(size=LIST_SIZE, seed=42):
    random.seed(seed)
    return [random.gauss(0, 1) for _ in range(size)]


@timeit
def write_huge_list_plain(huge_list):
    with open("huge_list_plain.txt", "w") as f:
        f.write(repr(huge_list))


my_huge_list = generate_huge_list(LIST_SIZE)
write_huge_list_plain(my_huge_list)

Makes sense, the sum of the partial times is close enough to that of the former solution, and the rest stays the same:

Execution time of generate_huge_list: 563 ms
Execution time of write_huge_list_plain: 750 ms

print(f"{os.path.getsize('huge_list_plain.txt') / 2 ** 10:.2f} KB")
# 20149.20 KB

with open("huge_list_plain.txt", "r") as f:
    loaded_huge_list = [float(i) for i in f.read()[1:-1].split(',')]

print(loaded_huge_list[42])
# 0.11441985746092122

This works very well for our need to store a massive list of numbers, and we can set it up without much difficulty for any type of list with a homogeneous data type.

But what if we need to store a dictionary? Or a series of nested structures? Or instances of classes? Or generators that are already half exhausted?

Python must have some way to achieve this, and it must be much simpler than adjusting how we import the string according to our needs. So in doing some research, we’ve found a couple of straightforward ways to store more complex objects.

Depending on our needs, we will choose one or the other. We’ll reuse the generate_huge_list function.

JSON solution

The json module. Basically, this allows us to save and import most of our data in a human-readable text format, safe from malicious code and easily interchangeable between programming languages:

import json


@timeit
def generate_huge_list(size=LIST_SIZE, seed=42):
    random.seed(seed)
    return [random.gauss(0, 1) for _ in range(size)]


@timeit
def write_huge_list_json(huge_list):
    with open("huge_list.json", "w") as f:
        json.dump(huge_list, f)


my_huge_list = generate_huge_list(LIST_SIZE)
write_huge_list_json(my_huge_list)

We reuse the same number generator function we used before. This way, we can compare the performance of the dump-to-disk code objectively between the different implementations. Neat!

Execution time of generate_huge_list: 563 ms
Execution time of write_huge_list_json: 1765 ms

print(f"{os.path.getsize('huge_list.json') / 2 ** 10:.2f} KB")
# 20149.20 KB

The JSON execution time looks pretty high compared to the direct text dump… Did we do something wrong?

The answer is “no.”

Simplifying, the extended runtime is due to the serialization process of the objects we want to store in the file.

Internally, JSON has to convert the various binary structures used by the computer into readable text, and it is an expensive process with numerous options and sanity checks.

The end result of our list, in this case, is, character-by-character, identical to the text dump of the plain text implementation. Without further trouble, we could use JSON to retrieve the list recorded with that implementation.

But, unlike the basic text implementation, JSON would allow us to record much more complex objects and retrieve them directly, without the need for manual tinkering with the retrieved text string.

The reimport of data into memory is direct. This is where clarity and speed are gained.

with open("huge_list.json", "r") as f:
    loaded_huge_list = json.load(f)

print(loaded_huge_list[42])
# 0.11441985746092122

We still have an ace up our sleeve: pickle

Pickle solution

Python wouldn’t be Python if there weren’t even more ways to do something correctly. Oversimplifying again, why not directly dump the content we want from memory to a file?

We just need to serialize it (move it from a memory mapping to a data stream). That’s what the pickle module does.

It has the great advantage of storing virtually any object, no matter how rare, speedily and efficiently.
It has disadvantages: it is not compatible with other formats and can execute malicious code from other sources. And it is unreadable by humans.

import pickle


@timeit
def generate_huge_list(size=LIST_SIZE, seed=42):
    random.seed(seed)
    return [random.gauss(0, 1) for _ in range(size)]


@timeit
def write_huge_list_pickle(huge_list):
    with open("huge_list.pickle", "wb") as f:
        pickle.dump(huge_list, f, protocol=-1)


my_huge_list = generate_huge_list(LIST_SIZE)
write_huge_list_pickle(my_huge_list)

As we did with JSON, we reuse the generator function.

Execution time of generate_huge_list: 563 ms
Execution time of write_huge_list_pickle: 16 ms

What? Less than two-hundredths of a second? Let’s look at the file.

print(f"{os.path.getsize('huge_list.pickle') / 2 ** 10:.2f} KB")
# 8792.23 KB

Less than half the size… Let’s check if we can recover the information.

with open("huge_list.pickle", "rb") as f:
    loaded_huge_list = pickle.load(f)

print(loaded_huge_list[42])
# 0.11441985746092122

Surprising. Versatile, lightning-fast, compressed, and straightforward to use – what more could we want?

Conclusions

We have seen three basic ways to save our generated data to files on disk for temporary use.

There are other more suitable ways to save more complex information (arrays, dataframes, databases, etc.) in a persistent and organized manner.

Still, these three that we see today only require the standard Python library and are perfectly suitable for saving our data to disk without any frills.

Plain text is perfect for storing text strings that we will use, such as word lists, email bodies, literary texts, etc.

JSON is the ideal solution for storing standardized structures such as lists and dictionaries in a universal and interoperable language. We can view its content without any problem in a web browser and use it with any programming language.

If necessary, we can even modify it in a simple text editor. And other third-party implementations are much faster, more resilient to somewhat heterodox representations, and more powerful.

Although it is common to come by negative comments about pickle (in my opinion, completely disproportionate), it is the perfect solution for dumping and reusing data that we generate and reuse locally. There is no more straightforward and more efficient solution than pickle. Period.

Here is what Python.org says about JSON/Pickle comparison:

There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;
JSON is human-readable, while pickle is not;
JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);
Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.

Here is the complete code. You can run it, and you will get an exciting output.

import random
import os
import json
import pickle

from finxter_tools import timeit

LIST_SIZE = 10 ** 6


@timeit
def generate_huge_list_naive_1():
    random.seed(42)
    # Smelly one-liner to erase the existing file :D
    with open("huge_list_naive_1.txt", "w") as f:
        f.write("")
    for _ in range(LIST_SIZE):
        with open("huge_list_naive_1.txt", "a") as f:
            f.write(f"{str(random.gauss(0, 1))},")


@timeit
def generate_huge_list_naive_2():
    random.seed(42)
    with open("huge_list_naive_2.txt", "w") as f:
        for _ in range(LIST_SIZE):
            f.write(f"{str(random.gauss(0, 1))},")


@timeit
def generate_huge_list_naive_3():
    random.seed(42)
    with open("huge_list_naive_3.txt", "w") as f:
        f.write(repr([random.gauss(0, 1) for _ in range(LIST_SIZE)]))


@timeit
def generate_huge_list(size=LIST_SIZE, seed=42):
    random.seed(seed)
    return [random.gauss(0, 1) for _ in range(size)]


@timeit
def write_huge_list_plain(huge_list):
    with open("huge_list_plain.txt", "w") as f:
        f.write(repr(huge_list))


@timeit
def write_huge_list_json(huge_list):
    with open("huge_list.json", "w") as f:
        json.dump(huge_list, f)


@timeit
def write_huge_list_pickle(huge_list):
    with open("huge_list.pickle", "wb") as f:
        pickle.dump(huge_list, f, protocol=-1)


# Generate the files.
print("\nExecuting alternatives:")
generate_huge_list_naive_1()
generate_huge_list_naive_2()
generate_huge_list_naive_3()
my_huge_list = generate_huge_list(LIST_SIZE)
write_huge_list_plain(my_huge_list)
write_huge_list_json(my_huge_list)
write_huge_list_pickle(my_huge_list)

# Print computed times and file sizes.
print("\nResulting file sizes:")
print(f" · Naive (1):  {os.path.getsize('huge_list_naive_1.txt') / 2 ** 10:.2f} KB")
print(f" · Naive (2):  {os.path.getsize('huge_list_naive_2.txt') / 2 ** 10:.2f} KB")
print(f" · Naive (3):  {os.path.getsize('huge_list_naive_3.txt') / 2 ** 10:.2f} KB")
print(f" · Plain text: {os.path.getsize('huge_list_plain.txt') / 2 ** 10:.2f} KB")
print(f" · JSON:       {os.path.getsize('huge_list.json') / 2 ** 10:.2f} KB")
print(f" · pickle:     {os.path.getsize('huge_list.pickle') / 2 ** 10:.2f} KB")

# Check if the contents are the same
files = [
    ("huge_list_naive_1.txt", "r", "[float(i) for i in f.read()[:-1].split(',')]"),
    ("huge_list_naive_2.txt", "r", "[float(i) for i in f.read()[:-1].split(',')]"),
    ("huge_list_naive_3.txt", "r", "[float(i) for i in f.read()[1:-1].split(',')]"),
    ("huge_list_plain.txt", "r", "[float(i) for i in f.read()[1:-1].split(',')]"),
    ("huge_list.json", "r", "json.load(f)"),
    ("huge_list.pickle", "rb", "pickle.load(f)"),
]
print("\nChecking if randomly selected contents are equal:")
index = random.randint(0, LIST_SIZE - 2)
for file, mode, command in files:
    with open(file, mode) as f:
        huge_list = eval(command)
        print(f"{file:>24}: {huge_list[index: index + 2]}")
print()

# For benchmarking, not truly testing :)
# Un-string the following triple double quote block to use benchmark.
# LIST_SIZE <= 10 ** 6 recommended.
# pytest-benchmark needed: https://pypi.org/project/pytest-benchmark/
#
# $> pytest filename.py

"""
def test_generate_huge_list_naive_1(benchmark):
    benchmark(generate_huge_list_naive_1)


def test_generate_huge_list_naive_2(benchmark):
    benchmark(generate_huge_list_naive_2)


def test_generate_huge_list_naive_3(benchmark):
    benchmark(generate_huge_list_naive_3)


def test_generate_huge_list(benchmark):
    benchmark(generate_huge_list, LIST_SIZE)


def test_write_huge_list_plain(benchmark):
    benchmark(write_huge_list_plain, generate_huge_list(LIST_SIZE))


def test_write_huge_list_json(benchmark):
    benchmark(write_huge_list_json, generate_huge_list(LIST_SIZE))


def test_write_huge_list_pickle(benchmark):
    benchmark(write_huge_list_pickle, generate_huge_list(LIST_SIZE))
"""

And the auxiliary finxter_tools.py with the timeit decorator:

from functools import wraps
from time import process_time


def timeit(func):
    @wraps(func)
    def chronometer(*args, **kwargs):
        start = int(round(process_time() * 1000))
        try:
            return func(*args, **kwargs)
        finally:
            stop = int(round(process_time() * 1000)) - start
            print(f"  Execution time of {func.__name__}: {max(stop, 0)} ms")

    return chronometer

Benchmarking results

Here are the results of benchmarking through pytest-benchmark.

The code and instructions to run the benchmarks are already implemented.