Parallel programming in Python

Computational Biology 2024

Benjamin Rombaut

Ghent University

2024-03-05

When to parallelize?

First, make sure your code is correct. Then, consider parallelizing if:

The code is slow
The code is CPU-bound
The code is parallelizable

I/O bound code can be parallelized with threads, but that just improves responsiveness on the main thread. Async functions can be used to improve throughput and serving web requests, but they do not automatically speed up CPU-bound computation.

What is the difference between concurrency and parallelism?

Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. For example, multitasking on a single-core machine.

Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.

Embarrassingly parallel

Some problems are “embarrassingly parallel”, meaning they can be easily parallelized. For example, processing multiple files, or running the same function with different parameters.

More complex problems require more complex parallelization strategies, and may not be worth the effort. The overhead of parallelization can be significant, and the speedup is not always linear. See Amdahl’s law

How to parallelize?

Read the docs and use the highest-level API that fits your needs. Start with the simplest solution, as the overhead of the complex solutions may not be worth the speedup. Ordered from simple to complex:

concurrent.futures.ProcessPoolExecutor
multiprocessing.Pool
joblib.Parallel
numpy (best not for this assignment)
ipyparallel.Client
numba.jit(parallel=True)
dask.distributed.Client

Note also the difference between threads and processes. Threads share memory, processes do not. Threads are faster to start, but can be limited by the Global Interpreter Lock (GIL). Processes are slower to start, but do not have the GIL problem.

Tips for assignments

Test the code for correctness!
Use the highest input size to test scalability, don’t optimize the overhead on small inputs
Compare different strategies in your own benchmark
Use a (line) profiler to understand the behaviour of the code and where to focus optimization
- https://github.com/pyutils/line_profiler
- IDE extensions
Pickling error?
- Define the function at the top level of the module, before any parallel code
Remember to also think about the access pattern of the data, and how to minimize data transfer

HPC

See HPC introduction presentation
Use VS Code Remote to connect your IDE to the HPC
Use the runner.pbs script to manage environments and submit jobs

Using conda on the HPC

Install conda or mamba
Install nb_conda_kernels in base environment
Create a new environment with conda create -n myenv python=3.8
Install the runner.pbs script at e.g. ~/bin/runner.pbs
Using OpenOndemand at login.hpc.ugent.be, create a Jupyter Lab interactive app session
Method 1) in the Custom Code section, activate the environment with modules or the runner.pbs script

# Method 2) install nb_conda_kernels in base environment for auto-discovery
conda activate base
conda install nb_conda_kernels
# Method 3) manually make the kernel available
conda activate myenv
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

Benchmarking on the HPC

Example code is available in at https://github.com/saeyslab/hydra_hpc_example. frequencies_hydra is the most high-level example and the easiest to use. dask_jobqueue can be used to run a distributed Dask cluster on the HPC, but is more complex.

src/sleep_pbs/README.md is an example used to explain interactive and job-based scheduling with PBS and SLURM. The example sleep script is benchmarked for runtime and memory usage with timeit and memray.
src/sleep_hydra/README.md is the same sleep example and benchmarking, but executed with the Hydra framework. More powerful and flexible, but also more complex.
src/dask_jobqueue/README.md is an example of how to use Hydra and submitit to launch a Dask jobqueue through the SLURM scheduler.
src/frequencies_hydra/README.md is the counting frequencies example with benchmarking. It uses a Python-only configuration, based on hydra-zen.