Introduction to HPC and benchmarking

Computational Biology 2024

Benjamin Rombaut

Ghent University

2024-02-21

(Premature) Optimization

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

– Donald Knuth in Structured Programming with go to Statements (1974)

The Rules of Optimization Club:

  1. You do not optimize.
  2. You do not optimize, without measuring first.
  3. When the performance is not bound by the code, but by external factors, the optimization is over.
  4. Only optimize code that already has full unit test coverage.
  5. One factor at a time.

Tips for optimizing speed

  • Improve implementation: remove unneeded code, reduce work in for loop…
  • Improve data structure: e.g. dict instead of list
  • Improve algorithm e.g. lower time complexity
  • more speed at the expense of simplicity
    • Improve usage of CPU e.g. multiprocessing
      • Global Interpreter Lock prevents true parallelism
      • staying single-threaded and running jobs with xargs or GNU parallel can be faster
    • Improve usage of accelerators: use better compiled libraries or e.g. BLAS, numba, cython, pycuda

Tips for optimizing memory

  • improve data structure:
  • reduce copies of data structure: e.g. numpy views, pandas Copy-on-Write
  • avoid materializing data structure e.g. generators
    • memory efficient, no barriers
    • elegant and clean functions
    • hard to debug, less familiar

Benchmarking and profiling

  • doctest, pytest, unit tests… come first!
  • Read the profile docs
  • A profile tries to understand the performance of a program
    • can be as simple as difference of time.perf_counter_ns before and after a function
      • but limited, no insight into effects of caching, variation, statistical significance…
    • don’t manually write down metrics from Windows Task Manager
      • not reproducible, reliable…
  • A benchmark tries to compare different code implementations

Prefer timeit for small code snippets

python -m timeit '"-".join(str(n) for n in range(100))'
10000 loops, best of 5: 30.2 usec per loop

A benchmark is much more than just runtime and memory usage

Example of a funkyheatmap. Other metrics are also taken into account such as scalability, stability, usability…

A profile is much more than just runtime and memory usage

Visualize snakeviz

python -m cProfile -o results.prof myscript.py
snakeviz results.prof

Icicle example graph of profiled function calls. Functions are sorted alphabetically in X. Wider in X means the function was sampled more and uses more resources. Y is depth in the stack frame, at the top is the highest calling function and at the bottom are the lowest functions.

Benchmark tools for comparing different code implementations

Some examples

Using time builtin is very limited

time sleep 0.01
sleep 0.01  0.00s user 0.00s system 11% cpu 0.020 total

Keeping track of time in a bash script has overhead and limited precision

start=`date +%s.%N`
sleep 0.01
end=`date +%s.%N`

runtime=$( echo "$end - $start" | bc -l )
echo $runtime
.025655000

Good profiling frameworks run the code multiple times and report statistics

hyperfine --runs 5 'sleep 0.01'
Benchmark 1: sleep 0.01
  Time (mean ± σ):      16.0 ms ±   1.8 ms    [User: 0.3 ms, System: 0.8 ms]
  Range (min … max):    13.7 ms …  18.3 ms    5 runs

Profiling multithreaded code with yappi becomes more complex

import yappi
import time
import threading

_NTHREAD = 3


def _work(n):
    time.sleep(n * 0.1)


yappi.start()

threads = []
# generate _NTHREAD threads
for i in range(_NTHREAD):
    t = threading.Thread(target=_work, args=(i + 1, ))
    t.start()
    threads.append(t)
# wait all threads to finish
for t in threads:
    t.join()

yappi.stop()

# retrieve thread stats by their thread id (given by yappi)
threads = yappi.get_thread_stats()
for thread in threads:
    print(
        "Function stats for (%s) (%d)" % (thread.name, thread.id)
    )  # it is the Thread.__class__.__name__
    yappi.get_func_stats(ctx_id=thread.id).print_all()
Function stats for (Thread) (3)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000017  0.000062  0.000062
doc3.py:8 _work                       1      0.000012  0.000045  0.000045

Function stats for (Thread) (2)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000017  0.000065  0.000065
doc3.py:8 _work                       1      0.000010  0.000048  0.000048


Function stats for (Thread) (1)

name                                  ncall  tsub      ttot      tavg
..hon3.7/threading.py:859 Thread.run  1      0.000010  0.000043  0.000043
doc3.py:8 _work                       1      0.000006  0.000033  0.000033

Memory profilers for Python

memory-profiler is still useful, but no longer actively maintained

python -m memory_profiler example.py

Line #    Mem usage    Increment  Occurrences   Line Contents
============================================================
     3   38.816 MiB   38.816 MiB           1   @profile
     4                                         def my_func():
     5   46.492 MiB    7.676 MiB           1       a = [1] * (10 ** 6)
     6  199.117 MiB  152.625 MiB           1       b = [2] * (2 * 10 ** 7)
     7   46.629 MiB -152.488 MiB           1       del b
     8   46.629 MiB    0.000 MiB           1       return a

viz in Excel, matplotlib or kcachegrind (Linux)

Tracking forked child processes over time

mprof run --multiprocess example.py
mprof plot

image

memray

🕵️‍♀️ Traces every function call so it can accurately represent the call stack, unlike sampling profilers.

ℭ Also handles native calls in C/C++ libraries so the entire call stack is present in the results.

🏎 Blazing fast! Profiling slows the application only slightly. Tracking native code is somewhat slower, but this can be enabled or disabled on demand.

📈 It can generate various reports about the collected memory usage data, like flame graphs.

🧵 Works with Python threads.

👽🧵 Works with native-threads (e.g. C++ threads in C extensions).

# Do regression testing with memray
@pytest.mark.limit_memory("24 MB")
def test_foobar():
    # do some stuff that allocates memory

Tips for assignments

  • Test the code for correctness
  • Create different input sizes to test scalability
  • Compare with a simple baseline (brute-force agorithm, non-operation to estimate overhead…)
  • Setup a benchmark to compare different implementations
  • Use a profiler to understand the performance of the code and where to focus optimization

HPC

  • See HPC introduction presentation
  • Use VS Code Remote to connect your IDE to the HPC
  • Use the runner.pbs script to manage environments and submit jobs

Using conda on the HPC

  • Install conda or mamba
  • Install nb_conda_kernels in base environment
  • Create a new environment with conda create -n myenv python=3.8
  • Install the runner.pbs script at e.g. ~/bin/runner.pbs
  • Using OpenOndemand at login.hpc.ugent.be, create a Jupyter Lab interactive app session
  • Method 1) in the Custom Code section, activate the environment with modules or the runner.pbs script
# Method 2) install nb_conda_kernels in base environment for auto-discovery
conda activate base
conda install nb_conda_kernels
# Method 3) manually make the kernel available
conda activate myenv
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

Benchmarking on the HPC

Example code is available in at https://github.com/saeyslab/hydra_hpc_example. frequencies_hydra is the most high-level example and the easiest to use.

  • src/sleep_pbs/README.md is an example used to explain interactive and job-based scheduling with PBS and SLURM. The example sleep script is benchmarked for runtime and memory usage with timeit and memray.
  • src/sleep_hydra/README.md is the same sleep example and benchmarking, but executed with the Hydra framework. More powerful and flexible, but also more complex.
  • src/dask_jobqueue/README.md is an example of how to use Hydra and submitit to launch a Dask jobqueue through the SLURM scheduler.
  • src/frequencies_hydra/README.md is the counting frequencies example with benchmarking. It uses a Python-only configuration, based on hydra-zen.