Statistical integration of multiple omics datasets using OmicsPLS

Abstract

In many epidemiological and clinical studies, multiple omics datasets are available, e.g. transcriptomics, glycomics, methylation. Since these datasets are supposed to represent the same complex biological mechanisms, data integration methods are typically applied. Two main challenges arise when dealing with omics data integration: (i) high dimensional and highly correlated features, (ii) heterogeneity among omics data. Partial least squares (PLS) and its extension two-way orthogonal PLS (O2PLS) address these challenges. These methods extract linear components which represent the relationship between the datasets (dimension reduction) and identify the most relevant features explaining this relationship. To facilitate interpretation of relevant features, a sparse group O2PLS approach can be used (GO2PLS). These methods are implemented in the open-source ‘OmicsPLS’ R package.

This half-day course introduces the fundamental concepts behind omics data integration based on latent variable models and joint principal components. We will explain how to deal with heterogeneity between omics data and how to perform feature selection. Emphasis is given on applications, and the OmicsPLS package will be introduced and used during a practical R session. At the end of the course, participants should be able to perform and interpret omics data integration with OmicsPLS.

Course outline

The course consists of two parts. The first session is a mix of theory and practice. You will be introduced to the basics of data integration approaches and their sparse variants. We will also discuss how to incorporate group structures such as CpG sites of a gene and genes of a pathway. You will practice with these approaches using OmicsPLS and try out different visualisation tools to interpret the results. The second breakout session will be hands-on, where you apply what you’ve learned to two omics datasets. There will be a brief rejoinder at the end of the session to discuss the output of the exercises.

  1. Dimension reduction with PCA
  2. Essentials of data integration: Partial least squares (PLS) approach
  3. Multi-omics data integration with O2PLS
  4. Methods to incorporate external biological information: Sparse group O2PLS (GO2PLS)
  5. Post-hoc analyses of the results using external bioinformatics databases

By the end of this course, you should be able to:

  • Have a deeper understanding of data integration methods (PLS, O2PLS, GO2PLS)
  • Understand how feature selection works
  • Implement O2PLS/GO2PLS on your own data with “OmicsPLS” R package
  • Interpret and visualise O2PLS/GO2PLS results

Target audience

The course is aimed at biostatisticians or medical researchers working with (high-dimensional) biological data, including multi-omics data, and want to learn how to combine these data. Participants are expected to be familiar with linear regression modelling, principal component analysis and have basic knowledge in R.

Presenters: Said el Bouhaddani, UMC Utrecht, NL and Jeanine Houwing-Duistermaat, University of Bologna, IT