Bayesian methods for small molecule identification

Ludwig, Marcus GND

Confident identification of small molecules remains a major challenge in untargeted metabolomics, natural product research and related fields. Liquid chromatography-tandem mass spectrometry is a predominant technique for the high-throughput analysis of small molecules and can detect thousands of different compounds in a biological sample. The automated interpretation of the resulting tandem mass spectra is highly non-trivial and many studies are limited to re-discovering known compounds by searching mass spectra in spectral reference libraries. But these libraries are vastly incomplete and a large portion of measured compounds remains unidentified. This constitutes a major bottleneck in the comprehensive, high-throughput analysis of metabolomics data. In this thesis, we present two computational methods that address different steps in the identification process of small molecules from tandem mass spectra. ZODIAC is a novel method for de novo that is, database-independent molecular formula annotation in complete datasets. It exploits similarities of compounds co-occurring in a sample to find the most likely molecular formula for each individual compound. ZODIAC improves on the currently best-performing method SIRIUS; on one dataset by 16.5 fold. We show that de novo molecular formula annotation is not just a theoretical advantage: We discover multiple novel molecular formulas absent from PubChem, one of the biggest structure databases. Furthermore, we introduce a novel scoring for CSI:FingerID, a state-of-the-art method for searching tandem mass spectra in a structure database. This scoring models dependencies between different molecular properties in a predicted molecular fingerprint via Bayesian networks. This problem has the unusual property, that the marginal probabilities differ for each predicted query fingerprint. Thus, we need to apply Bayesian networks in a novel, non-standard fashion. Modeling dependencies improves on the currently best scoring.



Citation style:
Could not load citation form.


Use and reproduction: