Intro to bioinformatics and Python

Computational Biology 2024

Benjamin Rombaut

Ghent University

2024-02-14

Bioinformatics is data science

Handling complex data from domain outside computer science
- bio, medicine, finance, physics, geospatial…
Large data and memory requirements
- Can’t load all the data into disk or memory at once
- Index data structures and algorithm complexity are important
Handle big and long-running computations
- workstation or HPC

Reasons to use Python for bioinformatics

similar NUMFOCUS stack as in other data sciences
- numpy (ndarray), pandas (DataFrame), matplotlib (plots)…
Scientific Python
- development guide for scientists and research software engineers
The Turing Way
- reproducible, ethical and collaborative data science
Python ecosystem of packages

Python as the programming glue for a large and diverse ecosystem

‘batteries included’ standard library
numerical and scientific computing
- numpy, scipy, pandas…
out-of-core, parallel and distributed processing
- dask, pyspark…
machine learning and deep learning
- scikit-learn, tensorflow, pytorch…
visualization
- matplotlib, seaborn, plotly…
stay high-level and generate performance code with query plan optimizer or compiler
- numba, polars, JAX, torch.compile…
pipelines and workflow managers
- nextflow, snakemake, kubeflow, airflow…

Use the Python standard library

Read to official Python documentation
Googling, Stack Overflow, ChatGPT… is not always ideal
Official documentation is supports different software versions, works offline…
Search via IDE or e.g. devdocs.io

Standard library modules

itertools
- Iterators for efficient looping.
collections
- Container datatypes providing alternatives to Python’s general purpose built-in containers, dict, list, set, and tuple.
functools
- Higher-order functions and operations on callable objects.
contextlib
- This module provides utilities for common tasks involving the with statement.
pathlib
- Object-oriented filesystem paths.

iterators for tree traversal (01)

iterators can be overly complex

class MyCounter:
    # my_counter = MyCounter()
    # next(my_counter) --> 0
    # next(my_counter) --> 1
    def __init__(self):
        self.current = 0
    def __iter__(self):
        return self
    def __next__(self):
        self.current += 1
        return current

generators are sometimes more elegant

def count(start=0, step=1):
    # count(0) --> 0 1 2 3 4 ...
    # count(2.5, 0.5) --> 2.5 3.0 3.5 ...
    n = start
    while True:
        yield n
        n += step

generators for tree traversal (02)

Trees ❤️ recursion; Python 💔 recursion
Subgenerator for tree traversal with yield from

class Tree:
    def __init__(self, left=None, right=None):
        self.left = left
        self.right = right
    def __iter__(self):
        return TreePath(self)

class TreePath:
    def __init__(self, tree):
        self.path = [tree]
    def __iter__(self):
        return self
    def __next__(self):
        current = None
        while current is None:
            if not self.path: raise StopIteration
            current = self.path.pop()
        self.path.append(current.right)
        self.path.append(current.left)
        return current

class A(Tree):
    def __init__(self, item, left=None, right=None):
        super().__init__(left=left, right=right)
        self.item = item

class Tree:
    def __init__(self, left=None, right=None):
        self.left = left
        self.right = right
    def __iter__(self):
        yield self
        if self.left:
            yield from self.left
        if self.right:
            yield from self.right

class A(Tree):
    def __init__(self, item, left=None, right=None):
        super().__init__(left=left, right=right)
        self.item = item

chain for tree traversal (03)

class Tree:
    def __init__(self, left=None, right=None):
        self.left = left
        self.right = right
    def __iter__(self):
        yield self
        if self.left:
            yield from self.left
        if self.right:
            yield from self.right

class A(Tree):
    def __init__(self, item, left=None, right=None):
        super().__init__(left=left, right=right)
        self.item = item

import itertools

class Tree:
    def __init__(self, left=None, right=None):
        self.left = left
        self.right = right
    def __iter__(self):
        return itertools.chain([self], self.left, self.right)

class A(Tree):
    def __init__(self, item, left=None, right=None):
        super().__init__(left=left, right=right)
        self.item = item

generating primes (04)

itertools.filterfalse itertools.takewhile

from itertools import takewhile, filterfalse, dropwhile

list(takewhile(lambda x: x<5, [1,4,6,4,1]))

[1, 4]

list(filterfalse(lambda x: x%2, range(10)))

[0, 2, 4, 6, 8]

list(dropwhile(lambda x: x<5, [1,4,6,4,1]))

[6, 4, 1]

You can go heavy on the functional programming in Python…

from itertools import filterfalse, count, takewhile
from operator import ge
from functools import partial

def div_by(m, n):
    return n % m == 0

def primes():
    l = count(2)
    while True:
        first = next(l)
        yield first
        l = iter(filterfalse(partial(div_by, first), l))

def smallprimes():
    print(list(takewhile(partial(ge, 100), primes())))

Don’t use mutable objects as default arguments (05)

def add_one(x=[]):
    x.append(1)
    return x
print(add_one())  # [1]
print(add_one())  # [1, 1]

Everything is a reference, so beware of pointer aliasing.

>>> xs = 5 * [[]]
>>> xs[0].append('a')
>>> xs
[['a'], ['a'], ['a'], ['a'], ['a']]

>>> x = [1, 2, 3]
>>> x.append(x)
>>> x
[1, 2, 3, [...]]

decorators (06)

class A:
    ...
    @staticmethod      @classmethod
    def f():           def f(cls):
        ...                ...

create your own decorator

def add_some(i=1):
    def h(f):
        def g(*args, **kwargs):
            return f(*args, **kwargs) + i
        return g
    return h

built-in decorators (07)

A least-recently-used cache makes this recursive function faster.

from functools import lru_cache, total_ordering

@lru_cache
def fibo(n):
    if n <= 1:
        return n
    return fibo(n - 1) + fibo(n - 2)

Total ordering ensures that all comparison methods are defined.

@total_ordering
class Country:
    ...
    def __lt__(self, other):
        return self.name < other.name
    def __eq__(self, other):
        return self.name == other.name

Magic methods allow for special behaviour of a class

Also called dunder methods (Double leading and trailing UNDERscore).

Private library methods only have one leading underscore e.g. _private_function.

# iterators
__iter__
__next__

# operator overloading
a < b       a.__lt__(b)
a <= b      a.__le__(b)
a > b       a.__gt__(b)
a >= b      a.__ge__(b)
a == b      a.__eq__(b)
a != b      a.__ne__(b)

__add__, __sub__, __mul__,
__truediv__, __floordiv__, __mod__,
__divmod__, __pow__, __lshift__,
__rshift__, __and__, __xor__,
__or__, __neg__, __pos__, __abs__,

# other magic methods
__doc__, __name__, __module__, 
__code__, __defaults__, 
__kwdefaults__, __globals__, 
__dict__, __closure__, __bases__, 

hash(x)          x.__hash__()
bytes(x)         x.__bytes__()
“{}”.format(x)   x.__format__(“{}”)
if x: ...        x.__bool__()
x()              x.__call__()
len(x)           x.__len__()

More magic methods

__init__(self, ...)
__new__(cls, ...)
__del__(self)

int(a)         a.__int__()
long(a)        a.__long__()
float(a)       a.__float__()
complex(a)     a.__complex__()
oct(a)         a.__oct__()
hex(a)         a.__hex__()
str(a)         a.__str__()
repr(a)        a.__repr__()

a[key]            a.__getitem__(key)
a[key] = value    a.__setitem__(key, value)
del a[key]        a.__delitem__(key)
item in a         a.__contains__(item)
reversed(a)       a.__reversed__()
a[non_key]        a.__missing__(non_key)

managed attributes (08)

important when creating libraries
check open-source codes of your favourite library to see how they use it

class A:
    def __getattribute__(self, name):
        print('Ik wil %s weten' % name)
        try:
            return super().__getattribute__(name)
        except AttributeError:
            return -1

    def __setattribute__(self, name, value):
        print('Ik wil %s instellen op %s' % (name, value))
        super().__setattr__(name, value)

    def __delattribute__(self, name):
        print('Ik wil %s verwijderen' % name)
        super().__delattr__(name)

# Note that this is not a magic method, but a guard using the __name__ attribute of the executed module
if __name__ == '__main__':
    a = A()
    print(a.x)

Ik wil x weten
-1

descriptors (09)

class A:
    __slots__ = ['x', 'y']

__annotations__
__reduce__
__reduce_ex__

context managers (10)

Database access

from contextlib import contextmanager

@contextmanager
def managed_resource(*args, **kwds):
    # Code to acquire resource, e.g.:
    resource = acquire_resource(*args, **kwds)
    try:
        yield resource
    finally:
        # Code to release resource, e.g.:
        release_resource(resource)

>>> with managed_resource(timeout=3600) as resource:
...     # Resource is released at the end of this block,
...     # even if code in the block raises an exception

Parallel IO
- filelock: Platform-independent file lock
- Distributed lock

creating data containers (11)

tuple
- downside is that you need to know the order of the elements
namedtuple
- now you can use the names of the elements and don’t need to remember the order
@dataclass
- more complicated, but allows inheritance and default values
@dataclass with slots (since 3.10)
- faster and smaller dataclass, less object overhead
TypedDict
- dictionary with fixed keys and types
numpy, pandas…
- more complex, but also more powerful

numpy

import numpy as np

a = np.array([[1, 2, 3],
              [4, 5, 6]])

a.ndim  == 2
a.shape == (2, 3)
a.size  == 6

a.dtype    == 'int64'
a.itemsize == 8

np.zeros((3, 4))
np.ones((2, 2, 2))
np.empty((7,)) == np.empty(7)

np.eye(3, 3)
np.arange(0, 16, 1).reshape(4, 4)

Use broadcasting instead of for loops

a = np.arange(6).reshape(2, 3)
b = np.random.rand(2, 3)
c = np.arange(8).reshape(2, 2, 3)

a + 2      a + b     a + c
a - 2      a - b     a - c
a * 2      a * b     a * b
a ** 2     a ** b    a ** c
a < 2      a < b     a < c
           a @ b.T

a.sum() == 15
a.sum(axis=0)    == array([3, 5, 7])
a.sum(axis=1)    == array([3, 12])
a.cumsum(axis=1) == array([[0, 1, 3],
                           [3, 7, 12]])

many other functions like min, max, mean…

Slices are all views of the same array

a = np.arange(16).reshape(4, 4)
a[1, 1]     == 5
a[1]        == array([4, 5, 6, 7])     
a[:, 1]     == array([1, 5, 9, 13])   
a[:2, :2]   == array([[0, 1], [4, 5]])
a[1, [1,2]] == array([5, 6])
a[a%3 == 0] == array([ 0,  3,  6,  9, 12, 15])

Numpy docs: indexing on ndarrays Numpy docs: copies and views

Some extra information on good numpy practices

def stochastic_function(high=10, size=5, rng=None, *args, **kwargs):
    rng = np.random.default_rng(rng)
    return rng.integers(high, *args, size=size, **kwargs)

# seed created with:
#   import secrets
#   secrets.randbits(128)
seed = 319929794527176038403653493598663843656
# create the RNG that you want to pass around
rng = np.random.default_rng(seed)
random_ints = stochastic_function(rng=rng)

numba

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Can be difficult to use and debug, so use only if you want to optimize even further!

from numba import njit
import random

@njit
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

numba can simplify parallelization

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def ident_parallel(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

@njit(parallel=True, fastmath=True)
def do_sum_parallel(A):
    # fasthmath=True allows for some unsafe optimizations
    # each thread can accumulate its own partial sum, and then a cross
    # thread reduction is performed to obtain the result to return
    n = len(A)
    acc = 0.
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(n):
        acc += np.sqrt(A[i])
    return acc

calculating frequencies (11)

Just Python is already fast, don’t optimize prematurely for small datasets
Fast methods only work if you use them correctly

import random
import timeit
import numpy as np
from collections import Counter, defaultdict

# create datasets, note that this does not take different frequency patters into account
smalllist = random.choices(range(100), k=1_000)
biglist = random.choices(range(10_000), k=1_000_000)
arr_biglist = np.array(biglist)

def if_contains_else(biglist):
    # using if-else is as the 'slowest' method
    frequencies = dict()
    for i in biglist:
        if i in frequencies:
            frequencies[i] += 1
        else:
            frequencies[i] = 1

def for_counter(biglist):
    # this can be the least efficient method because of the for loop
    frequencies = Counter()
    for i in biglist:
        frequencies[i] += 1

def counter(biglist):
    # using the Counter class directly is the fastest method
    frequencies = Counter(biglist)

def get_or_default(biglist):
    frequencies = dict()
    for i in biglist:
        frequencies[i] = frequencies.get(i, 0) + 1

def default_dict(biglist):
    # more elegant than get_or_default
    frequencies = defaultdict(int)
    for i in biglist:
        frequencies[i] += 1

def numpy_freq(biglist):
    # numpy array creation overhead make this slow
    unique, counts = np.unique(biglist, return_counts=True)

def numpy_freq_precreated(biglist):
    # without overhead, still slower than Counter for this size
    unique, counts = np.unique(arr_biglist, return_counts=True)

def try_except(biglist):
    # using a try clause is surprisingly good
    frequences = dict()
    for i in biglist:
        try:
            frequences[i] += 1
        except:
            frequences[i] = 1

                                     Time (s)
Function         Dataset                     
if_contains_else smalllist k=1_000      0.001
counter          smalllist k=1_000      0.000
for_counter      smalllist k=1_000      0.001
get_or_default   smalllist k=1_000      0.001
default_dict     smalllist k=1_000      0.001
try_except       smalllist k=1_000      0.001
numpy_freq       smalllist k=1_000      0.001
if_contains_else biglist k=1_000_000    0.972
counter          biglist k=1_000_000    0.484
for_counter      biglist k=1_000_000    1.385
get_or_default   biglist k=1_000_000    0.887
default_dict     biglist k=1_000_000    0.703
try_except       biglist k=1_000_000    0.761
numpy_freq       biglist k=1_000_000    1.038

Python code style

PEP 8: Style Guide for Python Code
Google Python Style Guide
Try to use latest features and best-practices (Python 3.12)
- Official docs, blogs posts…
Managing larger codebases and your project:
- Use templates like Scientific Python, Hypermodern Python, scverse-cookiecutter
- Use cruft to keep your project up-to-date with the template
- Use pre-commit to enforce code style

Type hinting (docs)

Use mypy or pyright for static type checking
- strike a balance between dynamic types and full type safety
Other uses
- pyspark (optimization of user-defined functions)
- generate API (FastAPI)
- generate configuration (hydra-zen)
- generate GUI (magicgui)

def greeting(name: str) -> str:
    return 'Hello ' + name

greeting(3)         # Argument 1 to "greeting" has incompatible type "int"; expected "str"
greeting(b'Alice')  # Argument 1 to "greeting" has incompatible type "bytes"; expected "str"
greeting("World!")  # No error

def bad_greeting(name: str) -> str:
    return 'Hello ' * name  # Unsupported operand types for * ("str" and "str")

Use TypeAlias to define complex types

# Python 3.11
# from typing import TypeAlias
# Vector = TypeAlias(list[float])
# Python 3.12 (not on the HPC VSC yet)
type Vector = list[float]

def scale(scalar: float, vector: Vector) -> Vector:
    return [scalar * num for num in vector]

# passes type checking; a list of floats qualifies as a Vector.
new_vector = scale(2.0, [1.0, -4.2, 5.4])

Use TypedDict to many arguments like kwargs

PEP 692: Using TypedDict for more precise **kwargs typing
Blogpost on usage of *args and **kwargs

from typing import TypedDict, Unpack, Required, NotRequired

class KWArgs(TypedDict):
    a: Required[int]
    b: NotRequired[str]

def consume(**kwargs: Unpack[KWArgs]) -> None:
    a = kwargs["a"]
    b = kwargs.get("b", "default value")

consume()              # Not allowed.
consume(a=1)           # Allowed.
consume(a=1, b="abc")  # Allowed.

Use doctest for user documentation of functions

Doctest documentation

def average(values):
    """Computes the arithmetic mean of a list of numbers.

    >>> print(average([20, 30, 70]))
    40.0
    """
    return sum(values) / len(values)

$ python -m doctest -v example.py
Trying:
    print(average([20, 30, 70]))
Expecting:
    40.0
ok

Python code style checklist for assignments grading

Use ruff for linter and code formatter
- Replaces flake8, isort, black, pydocstyle…
- PEP 8 and pylint feedback of Dodona
- PEP 257 docstring conventions
Use type hints and do a static type check with mypy
Use enough functions to avoid code duplication
- Turn the assignment into a set of smaller problems
Have clear documentation for every function, used algorithms, complexity…
Validate input and output, raise exceptions when necessary
Use the standard library as much as possible when appropriate
Be Pythonic

The Zen of Python (PEP 20)

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

BioPython for handling DNA and protein sequences

Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
Easier for long DNA/Protein strings than Python or pandas.
Alternatives are
- Python generators and iterators
- quickdna
- pandas.StringDtype
- dask.DataFrame.StringAccessor

Seq is a better sequence of letters

>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT')
>>> print(my_seq)
AGTACACTGGT
>>> my_seq
Seq('AGTACACTGGT')
>>> my_seq.complement()
Seq('TCATGTGACCA')
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT')

SeqRecord is a better container for sequence metadata

>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)

>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = "AC12345"
>>> simple_seq_r.description = "Made up sequence I wish I could write a paper about"
>>> print(simple_seq_r.description)
Made up sequence I wish I could write a paper about
>>> simple_seq_r.seq
Seq('GATC')

SeqIO is a better way to read and write sequences

$ cat ls_orchid.fasta
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...

>>> from Bio import SeqIO
>>> for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
...     print(seq_record.id)
...     print(repr(seq_record.seq))
...     print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
...

Sequences act like strings

>>> from Bio.Seq import Seq
>>> my_seq = Seq("GATCG")
>>> for index, letter in enumerate(my_seq):
...     print("%i %s" % (index, letter))
0 G
1 A
2 T
3 C
4 G
>>> print(len(my_seq))
5

Seq also has non-verlapping count like str

>>> from Bio.Seq import Seq
>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2

Calculate GC content

>>> from Bio.Seq import Seq
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
>>> GC(my_seq)
46.875

Reverse FASTA file

>>> from Bio import SeqIO
>>> records = (
        rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") 
        for rec in SeqIO.parse("file.fasta", "fasta")
        # optional guard clause:
        # if len(rec)<700
    )
>>> SeqIO.write(records, "rev_file.fasta", "fasta")

Notice the generator usage!

Work with unix-pipes

# solexa2sanger_fq.py
import sys
from Bio import SeqIO

SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq")

gunzip -c some_solexa.fastq.gz | python solexa2sanger_fq.py
python solexa2sanger_fq.py < some_solexa.fastq > some_phred.fastq

Great for xargs or GNU parallel!

ls *.fastq | parallel -j 4 "python solexa2sanger_fq.py < {} > {.}_phred.fastq"

Or use joblib or Dask task submissions for parallel processing in Python.