Python cProfile

Your Python app is slow? It’s time for a speed booster! Learn how in this tutorial.

As you read through the article, feel free to watch the explainer video:

Performance Tuning Concepts 101

I could have started this tutorial with a list of tools you can use to speed up your app. But I feel that this would create more harm than good because you’d spend a lot of time setting up the tools and very little time optimizing your performance.

Instead, I’ll take a different approach addressing the critical concepts of performance tuning first.

So, what’s more important than any one tool for performance optimization?

You must understand the universal concepts of performance tuning first.

The good thing is that you’ll be able to apply those concepts in any language and in any application.

The bad thing is that you must change your expectations a bit: I won’t provide you with a magic tool that speeds up your program on the push of a button.

Let’s start with the following list of the most important things to consider when you think you need to optimize your app’s performance:

Premature Optimization Is The Root Of All Evil

Premature optimization is one of the main problems of badly written code. But what is it anyway?

Definition: Premature optimization is the act of spending valuable resources (time, effort, lines of code, simplicity) to optimize code that doesn’t need to get optimized.

There’s no problem with optimized code per se. The problem is just that there’s no such thing as free lunch. If you think you optimize code snippets, what you’re really doing is to trade one variable (e.g. complexity) against another variable (e.g. performance). An example of such an optimization is to add a cache to avoid computing things repeatedly.

The problem is that if you’re doing it blindly, you may not even realize the harm you’re doing. For example, adding 50% more lines of code just to improve execution speed by 0.1% would be a trade-off that will screw up your whole software development process when done repeatedly.

But don’t take my word for it. This is what one of the most famous computer scientists of all times, Donald Knuth, says about premature optimization:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97 % of the time: premature optimization is the root of all evil.
Donald Knuth

A good heuristic is to write the most readable code per default. If this leads to an interactive application that’s already fast enough, good. If users of your application start complaining about speed, then take a structured approach to performance optimization, as described in this tutorial.

Action steps:

Make your code as readable and concise as you can.
Use comments and follow the coding standards (e.g. PEP8 in Python).
Ship your application and do user testing.
Is your application too slow? Really? Okay, then do the following:
Jot down the current performance of your app in seconds if you want to optimize for speed or bytes if you want to optimize for memory.
Do not cross this line until you’ve checked off the previous point.

Measure First, Improve Second

What you measure gets improved. The contrary also holds: what you don’t measure, doesn’t get improved.

This principle is a direct consequence of the first principle: “premature optimization is the root of all evil”. Why? Because if you do premature optimization, you optimize before you measure. But you should always only optimize after you have started your measurements. There’s no point in “improving” runtime if you don’t know from which level you want to improve. Maybe your optimization actually increased runtime? Maybe it had no effect at all? You cannot know unless you have started any attempt to optimize with a clear benchmark.

The consequence is to start with the most straightforward, naive (“dumb”) code that’s also easy to read. This is your benchmark. Any optimization or improvement idea must improve upon this benchmark. As soon as you’ve proven—by rigorous measurement—that your optimization improves your benchmark by X% in performance (memory footprint or speed), this becomes your new benchmark.

This way, your guaranteed to improve the performance of your code over time. And you can document, prove, and defend any optimization to your boss, your peer group, or even the scientific community.

Action steps:

You start with the naive solution that’s easy to read. Mostly, the naive solution is very easy to read.
You take the naive solution as benchmark by measuring its performance rigorously.
You document your measurements in a Google Spreadsheet (okay, you can also use Excel).
You come up with alternative code and measure its performance against the benchmark.
If the new code is better (faster, more memory efficient) than the old benchmark, the new code becomes the new benchmark. All subsequent improvements have to beat the new benchmark (otherwise, you throw them away).

Pareto Is King

I know it’s not big news: the 80/20 Pareto principle—named after Italian economist Vilfredo Pareto—is alive and well in performance optimization.

To exemplify this, have a look at my current CPU usage as I’m writing this:

If you plot this in Python, you see the following Pareto-like distribution:

Here’s the code that produces this output:

import matplotlib.pyplot as plt

labels = ['Cortana', 'Search', 'Explorer', 'System',
          'Desktop', 'Runtime', 'Snipping', 'Firefox',
          'Task', 'Dienst', 'Kapersky', 'Dienst2', 'CTF', 'Dienst3']

cpu = [8.3, 6.1, 4.6, 3.8, 2.2, 1.5, 1.4, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.3]

plt.barh(labels, cpu)
plt.xlabel('Percentage')
plt.savefig('screenshot_performance.jpg')
plt.show()

20% of the code requires 80% of the CPU usage (okay, I haven’t really checked if the numbers match but you get the point).

If I wanted to reduce CPU usage on my computer, I just need to close Cortana and Search and—voilà—a significant portion of the CPU load would be gone:

The interesting observation is that even by removing the two most expensive tasks, the plot looks just the same. Now there are two most expensive tasks: Explorer and System.

This leads us to the 1×1 of performance tuning:

Performance optimization is fractal. As soon as you’re done removing the bottleneck, there’s a new bottleneck lurking around. You “just” need to repeatedly remove the bottleneck to get maximal “bang for your buck”.

Action Steps:

Follow the algorithm.
Identify the bottleneck (= the function with highest negative impact on your performance).
Fix the bottleneck.
Repeat.

Algorithmic Optimization Wins

At this point, you’ve already figured out that you need to optimize your code. You have direct user feedback that your application is too slow. Or you have a strong signal (e.g. through Google Analytics) that your slow web app causes a higher than usual bounce rate etc.

You also know where you are now (in seconds or bytes) and where you want to go (in seconds or bytes).

You also know the bottleneck. (This is where the performance profiling tools discussed below come into play.)

Now, you need to figure out how to overcome the bottleneck. The best leverage point for you as a coder is to tune the algorithms and data structures.

Say, you’re working at a financial application. You know your bottleneck is the function calculate_ROI() that goes over all combinations of potential buying and selling points to calculate the maximum profit (the naive solution). As this is the bottleneck of the whole application, your first task is to find a better algorithm. Fortunately, you find the maximum profit algorithm. The computational complexity reduces from O(n**2) to O(n log n).

(If this particular topic interests you, start reading this SO article.)

Action steps:

Given your current bottleneck function.
Can you improve its data structures? Often, there’s a low hanging fruit by using sets instead of lists (e.g., checking membership is much faster for sets than lists), or dictionaries instead of collections of tuples.
Can you find better algorithms that are already proven? Can you tweak existing algorithms for your specific problem at hand?
Spend a lot of time researching these questions. It pays off. You’ll become a better computer scientist in the process. And it’s your bottleneck after all—so it’s a huge leverage point for your application.

All Hail to the Cache

Have you checked off all previous boxes? You know exactly where you are and where you want to go. You know what bottleneck to optimize. You know about alternative algorithms and data structures.

Here’s a quick and dirty trick that works surprisingly well for a large variety of applications. To improve your performance often means to remove unnecessary computations. One low-hanging fruit is to store the result of a subset of computations you have already performed in a cache.

How can you create a cache in practice? In Python, it’s as simple as creating a dictionary where you associate each function input (e.g. as an input string) with the function output.

You can then ask the cache to give you the computations you’ve already performed.

A simple example of an effective use of caching (sometimes called memoization) is the Fibonacci algorithm:

cache = dict()

def fib(n):
    if n in cache:
        return cache[n]
    if n < 2:
        return n
    fib_n = fib(n-1) + fib(n-2)
    cache[n] = fib_n
    return fib_n


print(fib(100))
# 354224848179261915075

The problem is that the function calls fib2(n-1) and fib2(n-2) calculate largely the same things. For instance, both separately calculate the Fibonacci value fib2(n-3). This adds up!

But with caching, you can simply memorize the results of previous computations so that the result for fib2(n-3) is calculated only once. All other times, you can pull the result from the cache and get an instant result.

Here’s the caching variant of Python Fibonacci:

def fib(n):
    if n in cache:
        return cache[n]
    if n < 2:
        return n
    fib_n = fib(n-1) + fib(n-2)
    cache[n] = fib_n
    return fib_n

You store the result of the computation fib(n-1) + fib(n-2) in the cache. If you already have the result of the n-th Fibonacci number, you simply pull it from the cache rather than recalculating it again and again.

Here’s the surprising speed improvement—just by using a simple cache:

import time

t1 = time.time()
print(fib2(40))
t2 = time.time()
print(fib(40))
t3 = time.time()

print("Fibonacci without cache: " + str(t2-t1))
print("Fibonacci with cache: " + str(t3-t2))


''' OUTPUT:
102334155
102334155
Fibonacci without cache: 31.577041387557983
Fibonacci with cache: 0.015461206436157227
'''

There are two basic strategies you can use:

Perform computations in advanced (“offline”) and store their results in the cache. This is a great strategy for web applications where you can fill up a large cache once (or once a day) and then simply serve the result of your precomputations to the users. For them, your calculations “feel” blazingly fast. But in reality, you just serve them precalculated values. Google Maps heavily uses this trick to speedup shortest path computations.
Perform computations as they appear (“online”) and store their results in the cache. This reactive form is the most basic and simplest form of caching where you don’t need to decide which computations to perform in advance.

In both cases, the more computations you store, the higher the likelihood of “cache hits” where the computation can be returned immediately. But as you usually have a memory limit (e.g. 100,000 cache entries), you need to decide about a sensible cache replacement policy.

Action steps:

Think: How can you reduce redundant computations? Would caching be a sensible approach?
What type of data / computations do you cache?
What’s the size of your cache?
Which entries to remove if the cache is full?
If you have a web application, can you reuse computations of previous users to compute the result of your current user?

Less is More

Your problem is too hard? Make it easier!

Yes, it’s obvious. But then again, so many coders are too perfectionistic about their code. They accept huge complexity and computational overhead—just for this small additional feature that often doesn’t even get recognized by users.

A powerful “trick” for performance optimization is to seek out easier problems. Instead of spending your effort optimizing, it’s often much better to get rid of complexity, unnecessary features and computations, data. Use heuristics rather than optimal algorithms wherever possible. You often pay for perfect results with a 10x slow down in performance.

So ask yourself this: what is your current bottleneck function really doing? Is it really worth the effort? Can you remove the feature or offer a down-sized version? If the feature is used by 1% of your users but 100% perceive the increased latency, it may be time for some minimalism!

Action step:

Can you remove your current bottleneck altogether by just skipping the feature?
Can you simplify the problem?
Think 80/20: get rid of one expensive feature to add 10 non-expensive ones.
Think opportunity costs: omit one important feature so that you can pursue a very important feature.

Know When to Stop

It’s easy to do but it’s also easy not to do: stop!

Performance optimization can be one of the most time-intensive things to do as a coder. There’s always room for improvement. You can always tweak and improve. But your effort to improve your performance by X increases superlinearly or even exponentially to X. At some point, it’s just a waste of your time of improving your performance.

Action step:

Ask yourself constantly: is it really worth the effort to keep optimizing?

Python Profilers

Python comes with different profilers. If you’re new to performance optimization, you may ask: what’s a profiler anyway?

A performance profiler allows you to monitor your application more closely. If you just run a Python script in your shell, you see nothing but the output produced by your program. But you don’t see how much bytes were consumed by your program. You don’t see how long each function runs. You don’t see the data structures that caused most memory overhead.

Without those things, you cannot know what’s the bottleneck of your application. And, as you’ve already learned above, you cannot possibly start optimizing your code. Why? Because else you were complicit in “premature optimization”—one of the deadly sins in programming.

Instrumenting profilers insert special code at the beginning and end of each routine to record when the routine starts and when it exits. With this information, the profiler aims to measure the actual time taken by the routine on each call. This type of profiler may also record which other routines are called from a routine. It can then display the time for the entire routine and also break it down into time spent locally and time spent on each call to another routine.
Fundamentals Profiling

Fortunately, there are a lot of profilers. In the remaining article, I’ll give you an overview of the most important profilers in Python and how to use them. Each comes with a reference for further reading.

The most popular Python profiler is called cProfile. You can import it much like any other library by using the statement:

import cProfile

A simple statement but nonetheless a powerful tool in your toolbox.

Let’s write a Python script which you can profile. Say, you come up with this (very) raw Python script to find 100 random prime numbers between 2 and 1000 which you want to optimize:

import random


def guess():
    ''' Returns a random number '''
    return random.randint(2, 1000)


def is_prime(x):
    ''' Checks whether x is prime '''
    for i in range(x):
        for j in range(x):
            if i * j == x:
                return False
    return True


def find_primes(num):
    primes = []
    for i in range(num):
        p = guess()
        while not is_prime(p):
            p = guess()
        primes += [p]
    return primes


print(find_primes(100))
'''
[733, 379, 97, 557, 773, 257, 3, 443, 13, 547, 839, 881, 997,
431, 7, 397, 911, 911, 563, 443, 877, 269, 947, 347, 431, 673,
467, 853, 163, 443, 541, 137, 229, 941, 739, 709, 251, 673, 613,
23, 307, 61, 647, 191, 887, 827, 277, 389, 613, 877, 109, 227,
701, 647, 599, 787, 139, 937, 311, 617, 233, 71, 929, 857, 599,
2, 139, 761, 389, 2, 523, 199, 653, 577, 211, 601, 617, 419, 241,
179, 233, 443, 271, 193, 839, 401, 673, 389, 433, 607, 2, 389,
571, 593, 877, 967, 131, 47, 97, 443]
'''

The program is slow (and you sense that there are many optimizations). But where to start?

As you’ve already learned, you need to know the bottleneck of your script. Let’s use the cProfile module to find it! The only thing you need to do is to add the following two lines to your script:

import cProfile
cProfile.run('print(find_primes(100))')

It’s really that simple. First, you write your script. Second, you call the cProfile.run() method to analyze its performance. Of course, you need to replace the execution command with your specific code you want to analyze. For example, if you want to test function f42(), you need to type in cProfile.run('f42()').

Here’s the output of the previous code snippet (don’t panic yet):

[157, 773, 457, 317, 251, 719, 227, 311, 167, 313, 521, 307, 367, 827, 317, 443, 359, 443, 887, 241, 419, 103, 281, 151, 397, 433, 733, 401, 881, 491, 19, 401, 661, 151, 467, 677, 719, 337, 673, 367, 53, 383, 83, 463, 269, 499, 149, 619, 101, 743, 181, 269, 691, 193, 7, 883, 449, 131, 311, 547, 809, 619, 97, 997, 73, 13, 571, 331, 37, 7, 229, 277, 829, 571, 797, 101, 337, 5, 17, 283, 449, 31, 709, 449, 521, 821, 547, 739, 113, 599, 139, 283, 317, 373, 719, 977, 373, 991, 137, 797]
         3908 function calls in 1.614 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.614    1.614 <string>:1(<module>)
      535    1.540    0.003    1.540    0.003 code.py:10(is_prime)
        1    0.000    0.000    1.542    1.542 code.py:19(find_primes)
      535    0.000    0.000    0.001    0.000 code.py:5(guess)
      535    0.000    0.000    0.001    0.000 random.py:174(randrange)
      535    0.000    0.000    0.001    0.000 random.py:218(randint)
      535    0.000    0.000    0.001    0.000 random.py:224(_randbelow)
       21    0.000    0.000    0.000    0.000 rpc.py:154(debug)
        3    0.000    0.000    0.072    0.024 rpc.py:217(remotecall)
        3    0.000    0.000    0.000    0.000 rpc.py:227(asynccall)
        3    0.000    0.000    0.072    0.024 rpc.py:247(asyncreturn)
        3    0.000    0.000    0.000    0.000 rpc.py:253(decoderesponse)
        3    0.000    0.000    0.072    0.024 rpc.py:291(getresponse)
        3    0.000    0.000    0.000    0.000 rpc.py:299(_proxify)
        3    0.000    0.000    0.072    0.024 rpc.py:307(_getresponse)
        3    0.000    0.000    0.000    0.000 rpc.py:329(newseq)
        3    0.000    0.000    0.000    0.000 rpc.py:333(putmessage)
        2    0.000    0.000    0.047    0.023 rpc.py:560(__getattr__)
        3    0.000    0.000    0.000    0.000 rpc.py:57(dumps)
        1    0.000    0.000    0.047    0.047 rpc.py:578(__getmethods)
        2    0.000    0.000    0.000    0.000 rpc.py:602(__init__)
        2    0.000    0.000    0.026    0.013 rpc.py:607(__call__)
        2    0.000    0.000    0.072    0.036 run.py:354(write)
        6    0.000    0.000    0.000    0.000 threading.py:1206(current_thread)
        3    0.000    0.000    0.000    0.000 threading.py:216(__init__)
        3    0.000    0.000    0.072    0.024 threading.py:264(wait)
        3    0.000    0.000    0.000    0.000 threading.py:75(RLock)
        3    0.000    0.000    0.000    0.000 {built-in method _struct.pack}
        3    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        6    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000    1.614    1.614 {built-in method builtins.exec}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        9    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.072    0.072 {built-in method builtins.print}
        3    0.000    0.000    0.000    0.000 {built-in method select.select}
        3    0.000    0.000    0.000    0.000 {method '_acquire_restore' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_is_owned' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_release_save' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.RLock' objects}
        6    0.071    0.012    0.071    0.012 {method 'acquire' of '_thread.lock' objects}
        3    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
      535    0.000    0.000    0.000    0.000 {method 'bit_length' of 'int' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        3    0.000    0.000    0.000    0.000 {method 'dump' of '_pickle.Pickler' objects}
        2    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
      553    0.000    0.000    0.000    0.000 {method 'getrandbits' of '_random.Random' objects}
        3    0.000    0.000    0.000    0.000 {method 'getvalue' of '_io.BytesIO' objects}
        3    0.000    0.000    0.000    0.000 {method 'release' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method 'send' of '_socket.socket' objects}

Let’s deconstruct it to properly understand the meaning of the output. The filename of your script is ‘code.py’. Here’s the first part:

>>>import cProfile
>>>cProfile.run('print(find_primes(100))')
[157, 773, 457, 317, 251, 719, 227, 311, 167, 313, 521, 307, 367, 827, 317, 443, 359, 443, 887, 241, 419, 103, 281, 151, 397, 433, 733, 401, 881, 491, 19, 401, 661, 151, 467, 677, 719, 337, 673, 367, 53, 383, 83, 463, 269, 499, 149, 619, 101, 743, 181, 269, 691, 193, 7, 883, 449, 131, 311, 547, 809, 619, 97, 997, 73, 13, 571, 331, 37, 7, 229, 277, 829, 571, 797, 101, 337, 5, 17, 283, 449, 31, 709, 449, 521, 821, 547, 739, 113, 599, 139, 283, 317, 373, 719, 977, 373, 991, 137, 797]
...

It still gives you the output to the shell—even if you didn’t execute the code directly, the cProfile.run() function did. You can see the list of the 100 random prime numbers here.

The next part prints some statistics to the shell:

         3908 function calls in 1.614 seconds

Okay, this is interesting: the whole program took 1.614 seconds to execute. In total, 3908 function calls have been executed. Can you figure out which?

The print() function once.
The find_primes(100) function once.
The find_primes() function executes the for loop 100 times.
In the for loop, we execute the range(), guess(), and is_prime() functions. The program executes the guess() and is_prime() functions multiple times per loop iteration until it correctly guessed the next prime number.
The guess() function executes the randint(2,1000) method once.

The next part of the output shows you the detailed stats of the function names ordered by the function name (not its performance):

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.614    1.614 <string>:1(<module>)
      535    1.540    0.003    1.540    0.003 code.py:10(is_prime)
        1    0.000    0.000    1.542    1.542 code.py:19(find_primes)
 ...

Each line stands for one function. For example the second line stands for the function is_prime. You can see that is_prime() had 535 executions with a total time of 1.54 seconds.

Wow! You’ve just found the bottleneck of the whole program: is_prime(). Again, the total execution time was 1.614 seconds and this one function dominates 95% of the total execution time!

So, you need to ask yourself the following questions: Do you need to optimize the code at all? If you do, how can you mitigate the bottleneck?

There are two basic ideas:

call the function is_prime() less frequently, and
optimize performance of the function itself.

You know that the best way to optimize code is to look for more efficient algorithms. A quick search reveals a much more efficient algorithm (see function is_prime2()).

import random


def guess():
    ''' Returns a random number '''
    return random.randint(2, 1000)


def is_prime(x):
    ''' Checks whether x is prime '''
    for i in range(x):
        for j in range(x):
            if i * j == x:
                return False
    return True


def is_prime2(x):
    ''' Checks whether x is prime '''
    for i in range(2,int(x**0.5)+1):
        if x % i == 0:
            return False
    return True


def find_primes(num):
    primes = []
    for i in range(num):
        p = guess()
        while not is_prime2(p):
            p = guess()
        primes += [p]
    return primes


import cProfile
cProfile.run('print(find_primes(100))')

What do you think: is our new prime checker faster? Let’s study the output of our code snippet:

[887, 347, 397, 743, 751, 19, 337, 983, 269, 547, 823, 239, 97, 137, 563, 757, 941, 331, 449, 883, 107, 271, 709, 337, 439, 443, 383, 563, 127, 541, 227, 929, 127, 173, 383, 23, 859, 593, 19, 647, 487, 827, 311, 101, 113, 139, 643, 829, 359, 983, 59, 23, 463, 787, 653, 257, 797, 53, 421, 37, 659, 857, 769, 331, 197, 443, 439, 467, 223, 769, 313, 431, 179, 157, 523, 733, 641, 61, 797, 691, 41, 751, 37, 569, 751, 613, 839, 821, 193, 557, 457, 563, 881, 337, 421, 461, 461, 691, 839, 599]
         4428 function calls in 0.074 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.073    0.073 <string>:1(<module>)
      610    0.002    0.000    0.002    0.000 code.py:19(is_prime2)
        1    0.001    0.001    0.007    0.007 code.py:27(find_primes)
      610    0.001    0.000    0.004    0.000 code.py:5(guess)
      610    0.001    0.000    0.003    0.000 random.py:174(randrange)
      610    0.001    0.000    0.004    0.000 random.py:218(randint)
      610    0.001    0.000    0.001    0.000 random.py:224(_randbelow)
       21    0.000    0.000    0.000    0.000 rpc.py:154(debug)
        3    0.000    0.000    0.066    0.022 rpc.py:217(remotecall)

Crazy – what a performance improvement! With the old bottleneck, the code takes 1.6 seconds. Now, it takes only 0.074 seconds—a 95% runtime performance improvement!

That’s the power of bottleneck analysis.

The cProfile method has many more functions and parameters but this simple method cProfile.run() is already enough to resolve many performance bottlenecks.

How to Sort the Output of the cProfile.run() Method?

To sort the output with respect to the i-th column, you can pass the sort=i argument to the cProfile.run() method. Here’s the help output:

>>> import cProfile
>>> help(cProfile.run)
Help on function run in module cProfile:

run(statement, filename=None, sort=-1)
    Run statement under profiler optionally saving results in filename

    This function takes a single argument that can be passed to the
    "exec" statement, and an optional file name.  In all cases this
    routine attempts to "exec" its first argument and gather profiling
    statistics from the execution. If no file name is present, then this
    function automatically prints a simple profiling report, sorted by the
    standard name string (file/line/function-name) that is presented in
    each line.

And here’s a minimal example profiling the above find_prime() method:

import cProfile
cProfile.run('print(find_primes(100))', sort=0)

The output is sorted by the number of function calls (first column):

[607, 61, 271, 167, 101, 983, 3, 541, 149, 619, 593, 433, 263, 823, 751, 149, 373, 563, 599, 607, 61, 439, 31, 773, 991, 953, 211, 263, 839, 683, 53, 853, 569, 547, 991, 313, 191, 881, 317, 967, 569, 71, 73, 383, 41, 17, 67, 673, 137, 457, 967, 331, 809, 983, 271, 631, 557, 149, 577, 251, 103, 337, 353, 401, 13, 887, 571, 29, 743, 701, 257, 701, 569, 241, 199, 719, 3, 907, 281, 727, 163, 317, 73, 467, 179, 443, 883, 997, 197, 587, 701, 919, 431, 827, 167, 769, 491, 127, 241, 41]
         5374 function calls in 0.021 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      759    0.000    0.000    0.000    0.000 {method 'getrandbits' of '_random.Random' objects}
      745    0.000    0.000    0.001    0.000 random.py:174(randrange)
      745    0.000    0.000    0.001    0.000 random.py:218(randint)
      745    0.000    0.000    0.000    0.000 random.py:224(_randbelow)
      745    0.001    0.000    0.001    0.000 code.py:18(is_prime2)
      745    0.000    0.000    0.001    0.000 code.py:4(guess)
      745    0.000    0.000    0.000    0.000 {method 'bit_length' of 'int' objects}
       21    0.000    0.000    0.000    0.000 rpc.py:154(debug)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        6    0.000    0.000    0.000    0.000 threading.py:1206(current_thread)
        6    0.018    0.003    0.018    0.003 {method 'acquire' of '_thread.lock' objects}
        6    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        3    0.000    0.000    0.000    0.000 threading.py:75(RLock)
        3    0.000    0.000    0.000    0.000 threading.py:216(__init__)
        3    0.000    0.000    0.018    0.006 threading.py:264(wait)
        3    0.000    0.000    0.000    0.000 rpc.py:57(dumps)
        3    0.000    0.000    0.019    0.006 rpc.py:217(remotecall)
        3    0.000    0.000    0.000    0.000 rpc.py:227(asynccall)
        3    0.000    0.000    0.018    0.006 rpc.py:247(asyncreturn)
        3    0.000    0.000    0.000    0.000 rpc.py:253(decoderesponse)
        3    0.000    0.000    0.018    0.006 rpc.py:291(getresponse)
        3    0.000    0.000    0.000    0.000 rpc.py:299(_proxify)
        3    0.000    0.000    0.018    0.006 rpc.py:307(_getresponse)
        3    0.000    0.000    0.000    0.000 rpc.py:333(putmessage)
        3    0.000    0.000    0.000    0.000 rpc.py:329(newseq)
        3    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        3    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method 'release' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_is_owned' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_acquire_restore' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_release_save' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        3    0.000    0.000    0.000    0.000 {method 'getvalue' of '_io.BytesIO' objects}
        3    0.000    0.000    0.000    0.000 {method 'dump' of '_pickle.Pickler' objects}
        3    0.000    0.000    0.000    0.000 {built-in method _struct.pack}
        3    0.000    0.000    0.000    0.000 {method 'send' of '_socket.socket' objects}
        3    0.000    0.000    0.000    0.000 {built-in method select.select}
        2    0.000    0.000    0.019    0.009 run.py:354(write)
        2    0.000    0.000    0.000    0.000 rpc.py:602(__init__)
        2    0.000    0.000    0.018    0.009 rpc.py:607(__call__)
        2    0.000    0.000    0.001    0.000 rpc.py:560(__getattr__)
        2    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.001    0.001 rpc.py:578(__getmethods)
        1    0.000    0.000    0.002    0.002 code.py:26(find_primes)
        1    0.000    0.000    0.021    0.021 <string>:1(<module>)
        1    0.000    0.000    0.021    0.021 {built-in method builtins.exec}
        1    0.000    0.000    0.019    0.019 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

If you want to learn more, study the official documentation.

How to Profile a Flask App?

If you’re running a flask application on a server, you often want to improve performance. But remember: you must focus on the bottlenecks of your whole application—not only the performance of the Flask app running on your server. There are many other possible performance bottlenecks such as database access, heavy use of images, wrong file formats, videos, embedded scripts, etc.

Before you start optimizing the Flask app itself, you should first check out those speed analysis tools that analyze the end-to-end latency as perceived by the user.

These online tools are free and easy to use: you just have to copy&paste the URL of your website and press a button. They will then point you to the potential bottlenecks of your app. Just run all of them and collect the results in an excel file or so. Then spend some time thinking about the possible bottlenecks until your pretty confident that you’ve found the main bottleneck.

Here’s an example of a Google Page Speed run for the wealth creation Flask app www.wealthdashboard.app:

It’s clear that in this case, the performance bottleneck is the work performed by the application itself. This doesn’t surprise as it comes with rich and interactive user interface:

So in this case, it makes absolute sense to dive into the Python Flask app itself which, in turn, uses the dash framework as a user interface.

If you’re interested in learning more about how to create beautiful dashboard applications in Python, check out our new book Python Dash.

You’ve seen dashboards before; think election result visualizations you can update in real-time, or population maps you can filter by demographic.

With the Python Dash library, you’ll create analytic dashboards that present data in effective, usable, elegant ways in just a few lines of code.

Get the book on NoStarch or Amazon!

So let’s start with the minimal example of the dash app. Note that the dash app internally runs a Flask server:

import dash
import dash_core_components as dcc
import dash_html_components as html

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

app.layout = html.Div(children=[
    html.H1(children='Hello Dash'),

    html.Div(children='''
        Dash: A web application framework for Python.
    '''),

    dcc.Graph(
        id='example-graph',
        figure={
            'data': [
                {'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'},
                {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': u'Montréal'},
            ],
            'layout': {
                'title': 'Dash Data Visualization'
            }
        }
    )
])

if __name__ == '__main__':
    #app.run_server(debug=True)
    import cProfile
    cProfile.run('app.run_server(debug=True)', sort=1)

Don’t worry, you don’t need to understand what’s going on. Only one thing is important: rather than running app.run_server(debut=True) in the third last line, you execute the cProfile.run(...) wrapper. You sort the output with respect to decreasing runtime (second column). The result of executing and terminating the Flask app looks as follows:

        6031 function calls (5967 primitive calls) in 3.309 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    3.288    1.644    3.288    1.644 {built-in method _winapi.WaitForSingleObject}
        1    0.005    0.005    0.005    0.005 {built-in method _winapi.CreateProcess}
        7    0.003    0.000    0.003    0.000 _winconsole.py:152(write)
        4    0.002    0.001    0.002    0.001 win32.py:109(SetConsoleTextAttribute)
       26    0.002    0.000    0.002    0.000 {built-in method nt.stat}
        9    0.001    0.000    0.004    0.000 {method 'write' of '_io.TextIOWrapper' objects}
        6    0.001    0.000    0.003    0.000 <frozen importlib._bootstrap>:882(_find_spec)
        1    0.001    0.001    0.001    0.001 win32.py:92(_winapi_test)
        5    0.000    0.000    0.000    0.000 {built-in method marshal.loads}
        5    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:914(get_data)
        5    0.000    0.000    0.000    0.000 {method 'read' of '_io.FileIO' objects}
        4    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.lock' objects}
      390    0.000    0.000    0.000    0.000 os.py:673(__getitem__)
        7    0.000    0.000    0.000    0.000 _winconsole.py:88(get_buffer)
...

So there have been 6031 function calls—but runtime was dominated by the method WaitForSingleObject() as you can see in the first row of the output table. This makes sense as I only ran the server and shut it down—it didn’t really process any request.

But if you’d execute many requests as you test your server, you’d quickly find out about the bottleneck methods.

There are some specific profilers for Flask applications. I’d recommend that you start looking here:

You can set up the profiler in just a few lines of code. However, this flask profiler focuses on the performance of multiple endpoints (“urls”). If you want to explore the function calls of a single endpoint/url, you should still use the cProfile module for fine-grained analysis.

An easy way of using the cProfile module in your flask application is the Werkzeug project. Using it is as simple as wrapping the flask app like this:

from werkzeug.contrib.profiler import ProfilerMiddleware
app = ProfilerMiddleware(app)

Per default, the profiled data will be printed to your shell or the standard output (depends on how you serve your Flask application).

Pandas Profiling Example

To profile your pandas application, you should divide your overall script into many functions and use Python’s cProfile module (see above). This will quickly point towards potential bottlenecks.

However, if you want to find out about a specific Pandas dataframe, you could use the following two methods:

Install the pandas-profiling tool: https://github.com/pandas-profiling/pandas-profiling
Use the built-in pandas dataframe describe() method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Summary

You’ve learned how to approach the problem of performance optimization conceptually:

Premature Optimization Is The Root Of All Evil
Measure First, Improve Second
Pareto Is King
Algorithmic Optimization Wins
All Hail to the Cache
Less is More
Know When to Stop

These concepts are vital for your coding productivity—they can save you weeks, if not months of mindless optimization.

The most important principle is to always focus on resolving the next bottleneck.

You’ve also learned about Python’s powerful cProfile module that helps you spot performance bottlenecks quickly. For the vast majority of Python applications, including Flask and Pandas, this will help you figure out the most critical bottlenecks.

Most of the time, there’s no need to optimize, say, beyond the first three bottlenecks (exception: scientific computing).

If you like the article, check out my free Python email course where I’ll send you a daily Python email for continuous improvement.