Introduction to HPC and benchmarking

Computational Biology 2025

Benjamin Rombaut

Ghent University

Simon Van de Vyver

Ghent University

2025-02-21

(Premature) Optimization

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

– Donald Knuth in Structured Programming with go to Statements (1974)

The Rules of Optimization Club:

You do not optimize.
You do not optimize, without without profiling your code first.
Think about external factors like ease-of-use and maintainability.
Only optimize code that already has full unit test coverage.
Optimize one code change or factor at a time.

Tips for optimizing speed

Improve implementation: remove unneeded code, reduce work in for loop…
Improve data structure: e.g. dict instead of list
Improve algorithm e.g. lower time complexity
more speed at the expense of simplicity
- Improve usage of CPU e.g. multiprocessing
  - Global Interpreter Lock prevents true parallelism
  - staying single-threaded and running jobs with xargs or GNU parallel can be faster
- Improve usage of accelerators: use better compiled libraries or e.g. BLAS, numba, cython, pycuda

Tips for optimizing memory

improve data structure:
- standard library collections
- numpy uint8 instead of float64
- Python slots
reduce copies of data structure: e.g. numpy views, pandas Copy-on-Write
avoid materializing data structure e.g. generators
- memory efficient, no barriers
- elegant and clean functions
- hard to debug, less familiar

Benchmarking and profiling

doctest, pytest, unit tests… come first!
Read the profile docs
A profile tries to understand the performance of a program
- Profiling is a type of runtime analysis that operates on large amounts of runtime data and gives you a view of what is happening inside a process.
- can be as simple as difference of time.perf_counter_ns before and after a function
  - but limited, no insight into effects of caching, variation, statistical significance…
- don’t manually write down metrics from Windows Task Manager
  - not reproducible, reliable…
A benchmark tries to compare different code implementations

Prefer timeit for small code snippets

python -m timeit '"-".join(str(n) for n in range(100))'
10000 loops, best of 5: 30.2 usec per loop

A profiler tries to understand the performance of a program

Time profile
Memory profile
Time tracing profile

A profiler tries to understand the performance of a program

Time profile
- %timeit for notebook
- python -m timeit
- cProfile traces every function call
- py-spy only samples the stack trace, but is very fast
Memory profile

A profiler tries to understand the performance of a program

Time profile
Memory profile
- sys.getsizeof() does not count size of references!
  - use pympler.asizeof
- Best option: memray
- memory-profiler
- scalene
  - nice web UI

The way you visualize the profile is important

Visualize snakeviz

python -m cProfile -o results.prof myscript.py
snakeviz results.prof

Icicle example graph of profiled function calls. Functions are sorted alphabetically in X. Wider in X means the function was sampled more and uses more resources. Y is depth in the stack frame, at the top is the highest calling function and at the bottom are the lowest functions.

A profiler tries to understand the performance of a program

Time profile
Time tracing profile
- yappi
- line_profiler
Memory profile

memory-profiler is still useful, but no longer actively maintained

python -m memory_profiler example.py

Line #    Mem usage    Increment  Occurrences   Line Contents
============================================================
     3   38.816 MiB   38.816 MiB           1   @profile
     4                                         def my_func():
     5   46.492 MiB    7.676 MiB           1       a = [1] * (10 ** 6)
     6  199.117 MiB  152.625 MiB           1       b = [2] * (2 * 10 ** 7)
     7   46.629 MiB -152.488 MiB           1       del b
     8   46.629 MiB    0.000 MiB           1       return a

viz in Excel, matplotlib or kcachegrind (Linux)

Tracking forked child processes over time

mprof run --multiprocess example.py
mprof plot

memray

🕵️‍♀️ Traces every function call so it can accurately represent the call stack, unlike sampling profilers.

ℭ Also handles native calls in C/C++ libraries so the entire call stack is present in the results.

🏎 Blazing fast! Profiling slows the application only slightly. Tracking native code is somewhat slower, but this can be enabled or disabled on demand.

📈 It can generate various reports about the collected memory usage data, like flame graphs.

🧵 Works with Python threads.

👽🧵 Works with native-threads (e.g. C++ threads in C extensions).

Profiling multithreaded code with yappi becomes more complex

import yappi
import time
import threading

_NTHREAD = 3


def _work(n):
    time.sleep(n * 0.1)


yappi.start()

threads = []
# generate _NTHREAD threads
for i in range(_NTHREAD):
    t = threading.Thread(target=_work, args=(i + 1, ))
    t.start()
    threads.append(t)
# wait all threads to finish
for t in threads:
    t.join()

yappi.stop()

# retrieve thread stats by their thread id (given by yappi)
threads = yappi.get_thread_stats()
for thread in threads:
    print(
        "Function stats for (%s) (%d)" % (thread.name, thread.id)
    )  # it is the Thread.__class__.__name__
    yappi.get_func_stats(ctx_id=thread.id).print_all()

Function stats for (Thread) (3)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000017  0.000062  0.000062
doc3.py:8 _work                       1      0.000012  0.000045  0.000045

Function stats for (Thread) (2)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000017  0.000065  0.000065
doc3.py:8 _work                       1      0.000010  0.000048  0.000048


Function stats for (Thread) (1)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000010  0.000043  0.000043
doc3.py:8 _work                       1      0.000006  0.000033  0.000033

Your IDE can help you with profiling and visualization

VS Code
PyCharm
- Time profiling
- Flame Graph
- Call Tree/Graph
- Method List

Benchmarking and profiling

doctest, pytest, unit tests… come first!
Read the profile docs
A profile tries to understand the performance of a program
- can be as simple as difference of time.perf_counter_ns before and after a function
  - but limited, no insight into effects of caching, variation, statistical significance…
- don’t manually write down metrics from Windows Task Manager
  - not reproducible, reliable…
A benchmark tries to compare different code implementations

A benchmark is much more than just runtime and memory usage

Example of a funkyheatmap. Other metrics are also taken into account such as scalability, stability, usability…

Open Problems in Single-Cell Analysis

Some general benchmark tools

Time
- hyperfine
  - flexible and powerful CLI utility with good defaults
  - not memory usage (see poop)
  - --parameter-list and --setup to benchmark across git branches
Hydra multirun or hydra-zen
- config manager for Python, see this example
Time and memory
- pyperf
- pytest-benchmark
- airspeed velocity
- Criterion (for Rust)

Some examples

Using time builtin is very limited

time sleep 0.01

sleep 0.01  0.00s user 0.00s system 11% cpu 0.020 total

Keeping track of time in a bash script has overhead and limited precision

start=`date +%s.%N`
sleep 0.01
end=`date +%s.%N`

runtime=$( echo "$end - $start" | bc -l )
echo $runtime

.025655000

Good benchmark frameworks run the code multiple times and report statistics

hyperfine --runs 5 'sleep 0.01'

Benchmark 1: sleep 0.01
  Time (mean ± σ):      16.0 ms ±   1.8 ms    [User: 0.3 ms, System: 0.8 ms]
  Range (min … max):    13.7 ms …  18.3 ms    5 runs

Tips for assignments

Test the code for correctness
Create different input sizes to test scalability
Compare with a simple baseline (brute-force agorithm, non-operation to estimate overhead…)
Setup a benchmark to compare different implementations
Use a profiler to understand the performance of the code and where to focus optimization

HPC-UGent

See HPC-UGent introduction presentation
See the HPC-UGent documentation
Use VS Code Remote Tunnel to connect your IDE to the HPC
Use vsc-venv to manage your Python environments
See this tips and tricks presentation from the VSC User Day for more information

Benchmarking on the HPC

Some older example code is available in at https://github.com/saeyslab/hydra_hpc_example:

src/sleep_pbs/README.md is an example used to explain interactive and job-based scheduling with PBS and SLURM. The example sleep script is benchmarked for runtime and memory usage with timeit and memray.
src/sleep_hydra/README.md is the same sleep example and benchmarking, but executed with the Hydra framework. More powerful and flexible, but also more complex.
src/dask_jobqueue/README.md is an example of how to use Hydra and submitit to launch a Dask jobqueue through the SLURM scheduler.
src/frequencies_hydra/README.md is the counting frequencies example with benchmarking. It uses a Python-only configuration, based on hydra-zen.
src/dask_batchrunner/README.md is an example of how to use the SLURMRunner from dask-jobqueue to launch a Dask cluster on SLURM. It uses either Pixi or vsc-venv to manage the environment. It currently does not use MPI.
src/dask_mympi/README.md is an example of how to use the vsc-mympirun to launch a Dask cluster on SLURM. It uses vsc-venv to manage the environment.