Parallel programming in Python

Computational Biology 2025

Benjamin Rombaut

Ghent University

Simon Van de Vyver

Ghent University

2025-03-07

Python environment managers

Current approaches
- pip (Python-only)
- conda or micromamba (wide variety of precompiled packages)
Flemish HPC
Future approaches

Some tips

Use uvx for quick temporary environments:

uvx --python=3.12 --with numpy,pandas,matplotlib ipython # launches ipython REPL

List details of a Conda package:

pixi search numpy

Using channels: conda-forge

numpy-2.2.3-py313h991d4a7_0 (+ 4 builds)
----------------------------------------

Name                numpy              
Version             2.2.3
Build               py313h991d4a7_0    
Size                8058967

Find reverse dependencies of a project dependency:

pixi tree --invert numpy

numpy 2.1.3 
├── numba 0.61.0 
└── pandas 2.2.3

When to parallelize?

First, make sure your code is correct. Then, consider parallelizing if:

The code is slow
The code is CPU-bound
The code is parallelizable

I/O bound code can be parallelized with threads, but that just improves responsiveness on the main thread. Async functions can be used to improve throughput and serving web requests, but they do not automatically speed up CPU-bound computation.

What is the difference between concurrency and parallelism?

Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. For example, multitasking on a single-core machine.

Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.

Embarrassingly parallel

Some problems are “embarrassingly parallel”, meaning they can be easily parallelized. For example, processing multiple files, or running the same function with different parameters.

More complex problems require more complex parallelization strategies, and may not be worth the effort. The overhead of parallelization can be significant, and the speedup is not always linear. See Amdahl’s law

How to parallelize?

Read the docs and use the highest-level API that fits your needs. Start with the simplest solution, as the overhead of the complex solutions may not be worth the speedup.

Ordered from simple to complex, in bold are the most useful libraries:

1. concurrent.futures.[ProcessPoolExecutor, ThreadPoolExecutor] 2. joblib.Parallel 3. multiprocessing.Pool (lower-level than concurrent.futures) 4. ipyparallel.Client 5. numba.jit(parallel=True) 6. dask.distributed.Client

Tips for parallelization

Note also the difference between threads and processes. Threads share memory, processes do not. Threads are faster to start, but can be limited by the Global Interpreter Lock (GIL). Processes are slower to start, but do not have the GIL problem.
Use optional imports to avoid importing the parallelization library when not needed
Prefer a hybrid approach with a mix of serial code for small tasks and parallel code for large tasks

Joblib example

import importlib.util

def process(x):
    import time
    time.sleep(1) # Simulate a long computation
    return x**2

def main(xs):
    input_is_large = len(xs) > 5
    joblib_available = importlib.util.find_spec("joblib")
    if input_is_large and joblib_available:
        from joblib import Parallel, delayed, parallel_config
        with parallel_config(backend='threading'):
            return Parallel(n_jobs=-1)(delayed(process)(x) for x in xs)
    else:
        return [process(x) for x in xs]

In [5]: %timeit main(range(5))
5.01 s ± 4.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit main(range(6))
1.01 s ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tips for assignments

Test the code for correctness!
Use the highest input size to test scalability, don’t optimize the overhead on small inputs
Compare different strategies in your own benchmark
Use a (line) profiler to understand the behaviour of the code and where to focus optimization. See previous slides on profiling.
Pickling error?
- Define the function at the top level of the module, before any parallel code
Remember to also think about the access pattern of the data, and how to minimize data transfer