Algorithmics for biology

Gunnar Klau and Tobias Marschall – A Guided Tour to Computational Haplotyping

Human genomes come in pairs: every individual inherits one version of the genome from the mother and another version from the father. Hence, every chromosome exists in two similar yet distinct \copies”, called haplotypes. The problem of determining the full sequences of both haplotypes is known as phasing or haplotyping. In this paper, we review dierent approaches for haplotyping and point out how they are formalized as optimization problems. We survey dierent technologies and, in this way, provide guidance on the characteristics of problem instances resulting from present day technologies. Furthermore, we highlight open algorithmic challenges.

Fabio Vandin – Algorithms for Cancer Mutation Networks

DNA sequencing technologies have allowed the measurement of somatic aberrations in thousands of cancer genomes. One of the main goals in the analysis of such datasets is the identification of driver mutations important for cancer, distinguishing them from random, passenger mutations. This goal is hindered by the extensive genetic heterogeneity in cancer, with different patients presenting different collections of mutated genes. This heterogeneity arises because genes and mutations act in the context of networks, with groups of interacting genes (pathways) that perform different cellular functions, and most functions can be altered by mutating any of the genes in the associated pathway.

I will discuss two computational problems that arise in the analysis of cancer mutations and their associated networks. The first one is the problem of finding connected subnetworks of a large gene-gene interaction network that are mutated in a large number of patients, that we prove is NP-hard. I will present a polynomial time approximation algorithm that we designed to identify subnetworks with provable guarantees on their quality. I will also present a recent ILP formulation that identifies solutions of better quality compared to the approximation algorithm and is much faster on data from large cancer studies.

The second problem is the problem of identifying subnetworks of a large gene-gene interaction network that have mutations associated with survival from genome-wide mutation data of a large cohort of cancer patients. We formally define the associated computational problem by using a score for subnetworks based on the test statistic of the log-rank test, a widely used statistical test for comparing the survival of two given populations. We show that the computational problem is NP-hard in general and that it remains NP-hard even when restricted to graphs with at least one node of large degree, the case of interest for gene- gene interaction networks. I will present Network of Mutations Associated with Survival (NoMAS), a novel color-coding based algorithm that we have designed to find subnetworks of a large interaction network whose mutations are associ- ated with survival time. I will present results of the application of NoMAS on large cancer datasets.

Gregory Kucherov – Modern algorithmic techniques for biosequence search

Due to present-day high-throughput sequencing technologies, genomics is becoming a major big data generator. This evolution causes a change of the type of algorithmic techniques for searching and comparing sequencing data: traditional alignment-based methods are being replaced by alignment-free techniques. In this talk, we briefly overview main algorithmic ideas applied to biosequence analysis. We then focus on most recent developments and show examples of solving large-scale sequence processing tasks involving multi-GB datasets. Finally, we briefly mention the metagenomic classification project that we are currently carrying out in our group.

Gianluca Della Vedova, Murray Patterson, Raffaella Rizzi and Mauricio Soto – Character-based Phylogeny Construction and its Application to Tumor Evolution

Character-based Phylogeny Construction is a well-known combinatorial problem whose input is a matrix $M$ and we want to compute a phylogeny that is compatible with the actual species encoded by $M$.
In this paper we survey some of the known formulations and algorithms for these problems. Finally, we present the connections between these problems and tumor evolution, and we discuss some of the most important open problems.