Python cProfile – A Helpful Guide with Prime Example

Python comes with different profilers. If you’re new to performance optimization, you may ask: what’s a profiler anyway?

A performance profiler allows you to monitor your application more closely. If you just run a Python script in your shell, you see nothing but the output produced by your program. But you don’t see how much bytes were consumed by your program. You don’t see how long each function runs. You don’t see the data structures that caused most memory overhead.

Without those things, you cannot know what’s the bottleneck of your application. And, as you’ve already learned above, you cannot possibly start optimizing your code. Why? Because else you were complicit in “premature optimization”—one of the deadly sins in programming.

Instrumenting profilers insert special code at the beginning and end of each routine to record when the routine starts and when it exits. With this information, the profiler aims to measure the actual time taken by the routine on each call. This type of profiler may also record which other routines are called from a routine. It can then display the time for the entire routine and also break it down into time spent locally and time spent on each call to another routine.

Fundamentals Profiling

Fortunately, there are a lot of profilers. In the remaining article, I’ll give you an overview of the most important profilers in Python and how to use them. Each comes with a reference for further reading.

Python cProfile

The most popular Python profiler is called cProfile. You can import it much like any other library by using the statement:

import cProfile

A simple statement but nonetheless a powerful tool in your toolbox.

Let’s write a Python script which you can profile. Say, you come up with this (very) raw Python script to find 100 random prime numbers between 2 and 1000 which you want to optimize:

import random


def guess():
    ''' Returns a random number '''
    return random.randint(2, 1000)


def is_prime(x):
    ''' Checks whether x is prime '''
    for i in range(x):
        for j in range(x):
            if i * j == x:
                return False
    return True


def find_primes(num):
    primes = []
    for i in range(num):
        p = guess()
        while not is_prime(p):
            p = guess()
        primes += [p]
    return primes


print(find_primes(100))
'''
[733, 379, 97, 557, 773, 257, 3, 443, 13, 547, 839, 881, 997,
431, 7, 397, 911, 911, 563, 443, 877, 269, 947, 347, 431, 673,
467, 853, 163, 443, 541, 137, 229, 941, 739, 709, 251, 673, 613,
23, 307, 61, 647, 191, 887, 827, 277, 389, 613, 877, 109, 227,
701, 647, 599, 787, 139, 937, 311, 617, 233, 71, 929, 857, 599,
2, 139, 761, 389, 2, 523, 199, 653, 577, 211, 601, 617, 419, 241,
179, 233, 443, 271, 193, 839, 401, 673, 389, 433, 607, 2, 389,
571, 593, 877, 967, 131, 47, 97, 443]
'''

The program is slow (and you sense that there are many optimizations). But where to start?

As you’ve already learned, you need to know the bottleneck of your script. Let’s use the cProfile module to find it! The only thing you need to do is to add the following two lines to your script:

import cProfile
cProfile.run('print(find_primes(100))')

It’s really that simple. First, you write your script. Second, you call the cProfile.run() method to analyze its performance. Of course, you need to replace the execution command with your specific code you want to analyze. For example, if you want to test function f42(), you need to type in cProfile.run('f42()').

Here’s the output of the previous code snippet (don’t panic yet):

[157, 773, 457, 317, 251, 719, 227, 311, 167, 313, 521, 307, 367, 827, 317, 443, 359, 443, 887, 241, 419, 103, 281, 151, 397, 433, 733, 401, 881, 491, 19, 401, 661, 151, 467, 677, 719, 337, 673, 367, 53, 383, 83, 463, 269, 499, 149, 619, 101, 743, 181, 269, 691, 193, 7, 883, 449, 131, 311, 547, 809, 619, 97, 997, 73, 13, 571, 331, 37, 7, 229, 277, 829, 571, 797, 101, 337, 5, 17, 283, 449, 31, 709, 449, 521, 821, 547, 739, 113, 599, 139, 283, 317, 373, 719, 977, 373, 991, 137, 797]
         3908 function calls in 1.614 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.614    1.614 <string>:1(<module>)
      535    1.540    0.003    1.540    0.003 code.py:10(is_prime)
        1    0.000    0.000    1.542    1.542 code.py:19(find_primes)
      535    0.000    0.000    0.001    0.000 code.py:5(guess)
      535    0.000    0.000    0.001    0.000 random.py:174(randrange)
      535    0.000    0.000    0.001    0.000 random.py:218(randint)
      535    0.000    0.000    0.001    0.000 random.py:224(_randbelow)
       21    0.000    0.000    0.000    0.000 rpc.py:154(debug)
        3    0.000    0.000    0.072    0.024 rpc.py:217(remotecall)
        3    0.000    0.000    0.000    0.000 rpc.py:227(asynccall)
        3    0.000    0.000    0.072    0.024 rpc.py:247(asyncreturn)
        3    0.000    0.000    0.000    0.000 rpc.py:253(decoderesponse)
        3    0.000    0.000    0.072    0.024 rpc.py:291(getresponse)
        3    0.000    0.000    0.000    0.000 rpc.py:299(_proxify)
        3    0.000    0.000    0.072    0.024 rpc.py:307(_getresponse)
        3    0.000    0.000    0.000    0.000 rpc.py:329(newseq)
        3    0.000    0.000    0.000    0.000 rpc.py:333(putmessage)
        2    0.000    0.000    0.047    0.023 rpc.py:560(__getattr__)
        3    0.000    0.000    0.000    0.000 rpc.py:57(dumps)
        1    0.000    0.000    0.047    0.047 rpc.py:578(__getmethods)
        2    0.000    0.000    0.000    0.000 rpc.py:602(__init__)
        2    0.000    0.000    0.026    0.013 rpc.py:607(__call__)
        2    0.000    0.000    0.072    0.036 run.py:354(write)
        6    0.000    0.000    0.000    0.000 threading.py:1206(current_thread)
        3    0.000    0.000    0.000    0.000 threading.py:216(__init__)
        3    0.000    0.000    0.072    0.024 threading.py:264(wait)
        3    0.000    0.000    0.000    0.000 threading.py:75(RLock)
        3    0.000    0.000    0.000    0.000 {built-in method _struct.pack}
        3    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        6    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000    1.614    1.614 {built-in method builtins.exec}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        9    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.072    0.072 {built-in method builtins.print}
        3    0.000    0.000    0.000    0.000 {built-in method select.select}
        3    0.000    0.000    0.000    0.000 {method '_acquire_restore' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_is_owned' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method '_release_save' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.RLock' objects}
        6    0.071    0.012    0.071    0.012 {method 'acquire' of '_thread.lock' objects}
        3    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
      535    0.000    0.000    0.000    0.000 {method 'bit_length' of 'int' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        3    0.000    0.000    0.000    0.000 {method 'dump' of '_pickle.Pickler' objects}
        2    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
      553    0.000    0.000    0.000    0.000 {method 'getrandbits' of '_random.Random' objects}
        3    0.000    0.000    0.000    0.000 {method 'getvalue' of '_io.BytesIO' objects}
        3    0.000    0.000    0.000    0.000 {method 'release' of '_thread.RLock' objects}
        3    0.000    0.000    0.000    0.000 {method 'send' of '_socket.socket' objects}

Let’s deconstruct it to properly understand the meaning of the output. The filename of your script is ‘code.py’. Here’s the first part:

>>>import cProfile
>>>cProfile.run('print(find_primes(100))')
[157, 773, 457, 317, 251, 719, 227, 311, 167, 313, 521, 307, 367, 827, 317, 443, 359, 443, 887, 241, 419, 103, 281, 151, 397, 433, 733, 401, 881, 491, 19, 401, 661, 151, 467, 677, 719, 337, 673, 367, 53, 383, 83, 463, 269, 499, 149, 619, 101, 743, 181, 269, 691, 193, 7, 883, 449, 131, 311, 547, 809, 619, 97, 997, 73, 13, 571, 331, 37, 7, 229, 277, 829, 571, 797, 101, 337, 5, 17, 283, 449, 31, 709, 449, 521, 821, 547, 739, 113, 599, 139, 283, 317, 373, 719, 977, 373, 991, 137, 797]
...

It still gives you the output to the shell—even if you didn’t execute the code directly, the cProfile.run() function did. You can see the list of the 100 random prime numbers here.

The next part prints some statistics to the shell:

         3908 function calls in 1.614 seconds

Okay, this is interesting: the whole program took 1.614 seconds to execute. In total, 3908 function calls have been executed. Can you figure out which?

  • The print() function once.
  • The find_primes(100) function once.
  • The find_primes() function executes the for loop 100 times.
  • In the for loop, we execute the range(), guess(), and is_prime() functions. The program executes the guess() and is_prime() functions multiple times per loop iteration until it correctly guessed the next prime number.
  • The guess() function executes the randint(2,1000) method once.

The next part of the output shows you the detailed stats of the function names ordered by the function name (not its performance):

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.614    1.614 <string>:1(<module>)
      535    1.540    0.003    1.540    0.003 code.py:10(is_prime)
        1    0.000    0.000    1.542    1.542 code.py:19(find_primes)
 ...

Each line stands for one function. For example the second line stands for the function is_prime. You can see that is_prime() had 535 executions with a total time of 1.54 seconds.

Wow! You’ve just found the bottleneck of the whole program: is_prime(). Again, the total execution time was 1.614 seconds and this one function dominates 95% of the total execution time!

So, you need to ask yourself the following questions: Do you need to optimize the code at all? If you do, how can you mitigate the bottleneck?

There are two basic ideas:

  • call the function is_prime() less frequently, and
  • optimize performance of the function itself.

You know that the best way to optimize code is to look for more efficient algorithms. A quick search reveals a much more efficient algorithm (see function is_prime2()).

import random


def guess():
    ''' Returns a random number '''
    return random.randint(2, 1000)


def is_prime(x):
    ''' Checks whether x is prime '''
    for i in range(x):
        for j in range(x):
            if i * j == x:
                return False
    return True


def is_prime2(x):
    ''' Checks whether x is prime '''
    for i in range(2,int(x**0.5)+1):
        if x % i == 0:
            return False
    return True


def find_primes(num):
    primes = []
    for i in range(num):
        p = guess()
        while not is_prime2(p):
            p = guess()
        primes += [p]
    return primes


import cProfile
cProfile.run('print(find_primes(100))')

What do you think: is our new prime checker faster? Let’s study the output of our code snippet:

[887, 347, 397, 743, 751, 19, 337, 983, 269, 547, 823, 239, 97, 137, 563, 757, 941, 331, 449, 883, 107, 271, 709, 337, 439, 443, 383, 563, 127, 541, 227, 929, 127, 173, 383, 23, 859, 593, 19, 647, 487, 827, 311, 101, 113, 139, 643, 829, 359, 983, 59, 23, 463, 787, 653, 257, 797, 53, 421, 37, 659, 857, 769, 331, 197, 443, 439, 467, 223, 769, 313, 431, 179, 157, 523, 733, 641, 61, 797, 691, 41, 751, 37, 569, 751, 613, 839, 821, 193, 557, 457, 563, 881, 337, 421, 461, 461, 691, 839, 599]
         4428 function calls in 0.074 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.073    0.073 <string>:1(<module>)
      610    0.002    0.000    0.002    0.000 code.py:19(is_prime2)
        1    0.001    0.001    0.007    0.007 code.py:27(find_primes)
      610    0.001    0.000    0.004    0.000 code.py:5(guess)
      610    0.001    0.000    0.003    0.000 random.py:174(randrange)
      610    0.001    0.000    0.004    0.000 random.py:218(randint)
      610    0.001    0.000    0.001    0.000 random.py:224(_randbelow)
       21    0.000    0.000    0.000    0.000 rpc.py:154(debug)
        3    0.000    0.000    0.066    0.022 rpc.py:217(remotecall)

Crazy – what a performance improvement! With the old bottleneck, the code takes 1.6 seconds. Now, it takes only 0.074 seconds—a 95% runtime performance improvement!

That’s the power of bottleneck analysis.

The cProfile method has many more functions and parameters but this simple method cProfile.run() is already enough to resolve many performance bottlenecks. If you want to learn more, study the official documentation.

Where to Go From Here?

You’ve learned how to use the cProfile module in Python to find the bottleneck of your application.

If you’re already optimizing performance of your Python apps, chances are that you can already earn six figures by selling your Python skills. Would you like to learn how?

Join the free webinar that shows you how to become a thriving coding business owner online!

[Webinar] Are You a Six-Figure Freelance Developer?

Join us. It’s fun! ?