Computational Biology 2024
Ghent University
2024-03-05
First, make sure your code is correct. Then, consider parallelizing if:
I/O bound code can be parallelized with threads, but that just improves responsiveness on the main thread. Async functions can be used to improve throughput and serving web requests, but they do not automatically speed up CPU-bound computation.
Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. For example, multitasking on a single-core machine.
Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.
Some problems are “embarrassingly parallel”, meaning they can be easily parallelized. For example, processing multiple files, or running the same function with different parameters.
More complex problems require more complex parallelization strategies, and may not be worth the effort. The overhead of parallelization can be significant, and the speedup is not always linear. See Amdahl’s law
Read the docs and use the highest-level API that fits your needs. Start with the simplest solution, as the overhead of the complex solutions may not be worth the speedup. Ordered from simple to complex:
Note also the difference between threads and processes. Threads share memory, processes do not. Threads are faster to start, but can be limited by the Global Interpreter Lock (GIL). Processes are slower to start, but do not have the GIL problem.
conda create -n myenv python=3.8
~/bin/runner.pbs
Method 1)
in the Custom Code section, activate the environment with modules or the runner.pbs scriptExample code is available in at https://github.com/saeyslab/hydra_hpc_example. frequencies_hydra is the most high-level example and the easiest to use. dask_jobqueue can be used to run a distributed Dask cluster on the HPC, but is more complex.