Computational Biology 2025
Ghent University
Ghent University
2025-03-07
Use uvx
for quick temporary environments:
List details of a Conda package:
Using channels: conda-forge
numpy-2.2.3-py313h991d4a7_0 (+ 4 builds)
----------------------------------------
Name numpy
Version 2.2.3
Build py313h991d4a7_0
Size 8058967
Find reverse dependencies of a project dependency:
numpy 2.1.3
├── numba 0.61.0
└── pandas 2.2.3
First, make sure your code is correct. Then, consider parallelizing if:
I/O bound code can be parallelized with threads, but that just improves responsiveness on the main thread. Async functions can be used to improve throughput and serving web requests, but they do not automatically speed up CPU-bound computation.
Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. For example, multitasking on a single-core machine.
Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.
Some problems are “embarrassingly parallel”, meaning they can be easily parallelized. For example, processing multiple files, or running the same function with different parameters.
More complex problems require more complex parallelization strategies, and may not be worth the effort. The overhead of parallelization can be significant, and the speedup is not always linear. See Amdahl’s law
Read the docs and use the highest-level API that fits your needs. Start with the simplest solution, as the overhead of the complex solutions may not be worth the speedup.
Ordered from simple to complex, in bold are the most useful libraries:
1. concurrent.futures.[ProcessPoolExecutor, ThreadPoolExecutor] 2. joblib.Parallel 3. multiprocessing.Pool (lower-level than concurrent.futures) 4. ipyparallel.Client 5. numba.jit(parallel=True) 6. dask.distributed.Client
import importlib.util
def process(x):
import time
time.sleep(1) # Simulate a long computation
return x**2
def main(xs):
input_is_large = len(xs) > 5
joblib_available = importlib.util.find_spec("joblib")
if input_is_large and joblib_available:
from joblib import Parallel, delayed, parallel_config
with parallel_config(backend='threading'):
return Parallel(n_jobs=-1)(delayed(process)(x) for x in xs)
else:
return [process(x) for x in xs]