Confident metabolite structure annotation with COSMIC

Small molecules are key to biomarker discovery, drug development, toxicity screenings of ecosystems like rivers and lakes, and many more important research areas in multiple life sciences. Elucidating the exact structure of these metabolites is often crucial in determining their functionality, however, confident annotation of these structures remains a major challenge. To analyse samples of small molecules occurring in nature, mass spectrometry is the currently predominant technique. While mass spectrometry is used to measure the mass of a compound, tandem mass spectrometry can be used to additionally measure the mass of its fragments. The resulting spectral data however is highly non-trivial to interpret. This bottleneck accelerates the development of computational tools to annotate metabolite structures from mass spectrometry data, which enables rapid, large-scale structure annotation independent from spectral libraries. These tools return some proportion of incorrect annotations, which can vastly outnumber correct annotations. Scientists using these tools need to be able to differentiate correct from incorrect annotations. We develop an E-value computation that is based on proxy decoys drawn from the PubChem database and show that this E-value score outperforms the current CSI:FingerID hit score for the task of separating correct from incorrect annotations. To further improve on this, we develop a Percolator inspired machine learning approach, where we train linear support vector machines for this separation task. The confidence score outperforms the original CSI:FingerID hit score, the E-value score and all other tools that participated in the CASMI 2016 contest by a wide margin. Arguably, our confidence score enables confident structure annotation for a relevant portion of a dataset for the first time. We then show the power of this COSMIC workflow by annotating novel bile acid conjugate structures never reported before in a mouse fecal dataset.

Cite

Citation style:
Could not load citation form.

Rights

Use and reproduction: