OmicsPLS is an R package to integrate two (omics) datasets. There were two major limitations until now.

  1. The whole cross-validation grid was not allowed to exceed the limitations of the numbers of components. This led to awkward situations: Suppose ncol(X)=ncol(Y)=5. Then you may want to cross-validate with n=5,nx=0,ny=0 but also with n=1,nx=4,ny=4. Specifying crossval_o2m(X,Y,1:5,1:4,1:4,...) led to errors, since 5+4=9 exceed the data dimensions.
  2. There was no support for datasets with missing values. So any imputation should be done prior to running OmicsPLS. In practice this means that you would have to find out a suitable imputation method and implementation yourself. While this is good practice, it may not be time/effort efficient in exploratory data analysis.

A new version is out there to address these limitations!

Install the new version - OmicsPLS 1.3.0

At the moment, OmicsPLS 1.3.0 is in its own Dev branch, not in the master branch. I’d like to give it some time to find out whether there are hidden bugs in the new code. At a later time, the changes will be merged with master.

There are three (equivalent) ways to install the new OmicsPLS.

  1. Run devtools::install_github("selbouhaddani/OmicsPLS@Dev") using the devtools package
  2. Run remotes::install_github("selbouhaddani/OmicsPLS@Dev") using the remotes package
  3. Run BiocManager::install("selbouhaddani/OmicsPLS@Dev") using the BiocManager package

The last option is, in my opinion, a good practice, since BiocManager is usually installed anyway when working with omics data (and it’s the shortest line of code). It’s always a good idea to update R and other packages, just to avoid obscure error messages.

Addressing limitation 1 - flexible cross-validation

The first limitation was caused by the stop-if condition ncol(X) < max(a) + max(ax, ay) in the crossval_o2m* functions, which in turn was enabled by the stop-if statement ncol(X) < n + max(nx, ny) in the o2m function. Note that a,ax,ay are vectors, and n,nx,ny are integers. Even if the first stop-if condition was TRUE, there can be many combinations that are still admissible in o2m.

The solution that we’ve implemented is to issue a warning rather than an error in the first stop-if condition. Then, we added a check in the cross-validation functions on admissibility of the numbers of components. If this fails, the cross-validation for that combination returns an NA. In the print functions for crossval_o2m*, these NA’s are removed and the best MSE is taken among the remaining valid combinations.

An example is given here. Note that crossval_o2m_adjR2 is more sensitive, since it needs valid numbers of components for every choice in a. Otherwise, it will fail with a “subscript out of bounds” error.

## Set a seed
set.seed(786251)

## Load the OmicsPLS package
library(OmicsPLS)
## 
## Attaching package: 'OmicsPLS'
## The following object is masked from 'package:stats':
## 
##     loadings
## Check the version number, 1.3.0
print(utils::packageVersion("OmicsPLS"))
## [1] '1.3.0'
## Generate some random data
X <- scale(matrix(rnorm(100*5), nrow=100))
Y <- scale(matrix(rnorm(100*5), nrow=100))

## Do the cross-validation
crossval_o2m(X, Y, 1:5, 0:4, 0:4, nr_folds = 2)
## Warning in crossval_o2m(X, Y, 1:5, 0:4, 0:4, nr_folds = 2): Some combinations of # components exceed data dimensions, these combinations are not considered
## *******************
## Elapsed time: 0.91 sec
## *******
## Minimal 2-CV error is at ax=4 ay=0 a=1 
## *******
## Minimum MSE is 1.994971 
## *******************
crossval_o2m_adjR2(X, Y, 1:5, 0:4, 0:4, nr_folds = 2)
## Warning in crossval_o2m_adjR2(X, Y, 1:5, 0:4, 0:4, nr_folds = 2): Some combinations of # components exceed data dimensions, these combinations are not considered
## Minimum is at n = 1
## Elapsed time: 0.12 sec
##        MSE n nx ny
## 1 2.054871 1  2  2
## 2 2.130012 2  2  1
## 3 2.114823 3  1  1
## 4 2.120295 4  0  1
## 5 2.100264 5  0  0
## This will give an error!
try(o2m(X, Y, 5, 4, 4))
## Error in o2m(X, Y, 5, 4, 4) : 
##   n + max(nx, ny) = 9 exceeds #columns in X or Y

With this improvement, cross-validation in OmicsPLS is made slightly easier when data dimensions are not very large.

Addressing limitation 2 - missing data imputation

The second limitation involved missing data, caused by the stop-if condition any(is.na(X)). This is a standard requirement in many software, since algorithms define arithmetic operations on actual numbers and return NA otherwise. To still enable data analysis with missing data in OmicsPLS, a new function is added: impute_matrix(X, ...). This matrix is fully powered by softImpute from the softImpute package.

The statistical engine behind the impute_matrix function is imputation by penalized SVD completion. See the article by Mazumder, Hastie and Tibshirani for all details.

Let’s consider an example

## Set a seed
set.seed(164378)

## Load the OmicsPLS package
library(OmicsPLS)
## Check the version number, 1.3.0
print(utils::packageVersion("OmicsPLS"))
## [1] '1.3.0'
## Generate some random data
X <- scale(matrix(rnorm(100*50), nrow=100))
Y <- scale(matrix(rnorm(100*50), nrow=100))

## Fit OmicsPLS on original data
fit1 <- o2m(X, Y, 3, 3, 3) 

## Generate randomly 5% missings, quick and dirty
X[sample(length(X), length(X)/20)] <- NA

## Let's impute and fit again
X_imputed <- scale(impute_matrix(X))
fit2 <- o2m(X_imputed, Y, 3, 3, 3) 

Now compare the two summary outputs to see how different the results are.

summary(fit1)
## 
## *** Summary of the O2PLS fit *** 
## 
## -  Call: o2m(X = X, Y = Y, n = 3, nx = 3, ny = 3) 
## 
## -  Modeled variation
## -- Total variation:
## in X: 4950 
## in Y: 4950 
## 
## -- Joint, Orthogonal and Noise as proportions:
## 
##            data X data Y
## Joint       0.102  0.104
## Orthogonal  0.125  0.121
## Noise       0.774  0.775
## 
## -- Predictable variation in Y-joint part by X-joint part:
## Variation in T*B_T relative to U: 0.701 
## -- Predictable variation in X-joint part by Y-joint part:
## Variation in U*B_U relative to T: 0.702 
## 
## -- Variances per component:
## 
##          Comp 1  Comp 2  Comp 3
## X joint 158.850 162.439 182.438
## Y joint 202.338 160.845 150.682
## 
##         Comp 1  Comp 2  Comp 3
## X Orth 223.647 204.768 189.944
## 
##         Comp 1  Comp 2  Comp 3
## Y Orth 200.265 211.396 191.646
## 
## 
## -  Coefficient in 'U = T B_T + H_U' model:
## -- Diagonal elements of B_T =
##  0.961 0.845 0.737
summary(fit2)
## 
## *** Summary of the O2PLS fit *** 
## 
## -  Call: o2m(X = X_imputed, Y = Y, n = 3, nx = 3, ny = 3) 
## 
## -  Modeled variation
## -- Total variation:
## in X: 4950 
## in Y: 4950 
## 
## -- Joint, Orthogonal and Noise as proportions:
## 
##            data X data Y
## Joint       0.104  0.104
## Orthogonal  0.120  0.118
## Noise       0.776  0.778
## 
## -- Predictable variation in Y-joint part by X-joint part:
## Variation in T*B_T relative to U: 0.69 
## -- Predictable variation in X-joint part by Y-joint part:
## Variation in U*B_U relative to T: 0.696 
## 
## -- Variances per component:
## 
##          Comp 1  Comp 2  Comp 3
## X joint 169.108 153.002 193.018
## Y joint 200.414 179.735 136.997
## 
##         Comp 1  Comp 2  Comp 3
## X Orth 219.594 190.945 192.234
## 
##         Comp 1  Comp 2  Comp 3
## Y Orth 212.557 186.524 187.125
## 
## 
## -  Coefficient in 'U = T B_T + H_U' model:
## -- Diagonal elements of B_T =
##  0.897 0.91 0.702

Note that imputation remains an important step in data analyis, especially when you want to draw hard conclusions about the results. Therefore, it’s still recommended to investigate which imputation method is the best for your data.

Final remarks

With the new version, I hope that OmicsPLS will be a bit easier to use in more diverse applications with small datasets and missing values. And, as a preview, there is more to come soon! The sparse group O2PLS (GO2PLS) method will be added in OmicsPLS, making it a more complete analysis framework for omics data.

Finally, we are giving a workshop at the online Leeds Annual Statistical Research Workshop demonstrating OmicsPLS using real omics data. Registration and abstract submission is free!