Investigations on chemometric approaches for diagnostic applications utilizing various combinations of spectral and image data types
In the presented work, several data fusion and machine learning approaches were explored within the frame of the data combination for various measurement techniques in biomedical applications. For each of the measurement techniques used in this work, the data was ana-lyzed by means of machine learning. Prior to applying these machine learning algorithms, a specific preprocessing pipeline for each type of data had to be established. These pipelines made it possible to standardize the data and to decrease sample-to-sample variations which originate from the instability of devices or small deviations in the sample preparation or measurement routine. The preprocessed data sets were used for various analyses of biological samples. Separate data analyses were performed for microscopic images, Raman spectra, and SERS data. However, this work mainly focused on the application of data fusion methods for the analy-sis of biological tissues and cells. To do so, different data fusion pipelines were constructed for each task, depending on the data structure. Both low-level (centralized) and high-level (distributed) data fusion approaches were tested and investigated within in this work. To demonstrate centralized and distributed data fusion, two examples were implemented for tissue investigation. In both examples, a combination of Raman spectroscopic and MALDI spectrometric data were analyzed. One example demonstrated centralized data fusion for the analysis of the chemical composition of a mouse brain section, and the other example employed distributed data fusion for liver cancer detection. Other data fusion examples were demonstrated for cell-based analysis. It was demonstrated that leukocyte cell subtype identification can be improved by a centralized data fusion of Raman spectroscopic data and morphological features obtained from microscopic images of stained cells. The last example presented in this work demonstrated a sepsis diagnostic pipeline based on the combination of Raman spectroscopic data and biomarkers. Besides the measured values, the demographic information of the patient was included in the analysis process for considering non-disease-related variations. During the construction of data fusion pipelines, such issues as unbalanced data contribu-tion, missing values, and variations that are not related to the investigated responses were faced. To resolve these issues, data weighting, missing data imputation, and the introduc-tion of additional responses were employed. For further improvement of analysis reliability, the data fusion pipelines and data processing routine were adjusted for each study in this work. As a result, the most suitable data fusion approach was found for every example, and a combination of the machine learning methods with data fusion approaches was demon-strated as a powerful tool for data analysis in biomedical applications.