OmicsPLS is an R package to integrate two (omics) datasets. There were two major limitations until now.

**The whole cross-validation grid was not allowed to exceed the limitations of the numbers of components.**This led to awkward situations: Suppose`ncol(X)=ncol(Y)=5`

. Then you may want to cross-validate with`n=5,nx=0,ny=0`

but also with`n=1,nx=4,ny=4`

. Specifying`crossval_o2m(X,Y,1:5,1:4,1:4,...)`

led to errors, since 5+4=9 exceed the data dimensions.**There was no support for datasets with missing values.**So any imputation should be done prior to running OmicsPLS. In practice this means that you would have to find out a suitable imputation method and implementation yourself. While this is good practice, it may not be time/effort efficient in exploratory data analysis.

A new version is out there to address these limitations!

## Install the new version - OmicsPLS 1.3.0

At the moment, OmicsPLS 1.3.0 is in its own `Dev`

branch, not in the `master`

branch. I’d like to give it some time to find out whether there are hidden bugs in the new code. At a later time, the changes will be merged with `master`

.

There are three (equivalent) ways to install the new `OmicsPLS`

.

- Run
`devtools::install_github("selbouhaddani/OmicsPLS@Dev")`

using the`devtools`

package - Run
`remotes::install_github("selbouhaddani/OmicsPLS@Dev")`

using the`remotes`

package - Run
`BiocManager::install("selbouhaddani/OmicsPLS@Dev")`

using the`BiocManager`

package

The last option is, in my opinion, a good practice, since `BiocManager`

is usually installed anyway when working with omics data (and it’s the shortest line of code). It’s always a good idea to update R and other packages, just to avoid obscure error messages.

## Addressing limitation 1 - flexible cross-validation

The first limitation was caused by the stop-if condition `ncol(X) < max(a) + max(ax, ay)`

in the `crossval_o2m*`

functions, which in turn was enabled by the stop-if statement `ncol(X) < n + max(nx, ny)`

in the `o2m`

function. Note that `a,ax,ay`

are vectors, and `n,nx,ny`

are integers. Even if the first stop-if condition was `TRUE`

, there can be many combinations that are still admissible in `o2m`

.

The solution that we’ve implemented is to issue a warning rather than an error in the first stop-if condition. Then, we added a check in the cross-validation functions on admissibility of the numbers of components. If this fails, the cross-validation for that combination returns an `NA`

. In the print functions for `crossval_o2m*`

, these NA’s are removed and the best MSE is taken among the remaining valid combinations.

An example is given here. Note that `crossval_o2m_adjR2`

is more sensitive, since it needs valid numbers of components *for every* choice in `a`

. Otherwise, it will fail with a “subscript out of bounds” error.

```
## Set a seed
set.seed(786251)
## Load the OmicsPLS package
library(OmicsPLS)
```

```
##
## Attaching package: 'OmicsPLS'
```

```
## The following object is masked from 'package:stats':
##
## loadings
```

```
## Check the version number, 1.3.0
print(utils::packageVersion("OmicsPLS"))
```

```
## [1] '1.3.0'
```

```
## Generate some random data
X <- scale(matrix(rnorm(100*5), nrow=100))
Y <- scale(matrix(rnorm(100*5), nrow=100))
## Do the cross-validation
crossval_o2m(X, Y, 1:5, 0:4, 0:4, nr_folds = 2)
```

```
## Warning in crossval_o2m(X, Y, 1:5, 0:4, 0:4, nr_folds = 2): Some combinations of # components exceed data dimensions, these combinations are not considered
```

```
## *******************
## Elapsed time: 0.91 sec
## *******
## Minimal 2-CV error is at ax=4 ay=0 a=1
## *******
## Minimum MSE is 1.994971
## *******************
```

```
crossval_o2m_adjR2(X, Y, 1:5, 0:4, 0:4, nr_folds = 2)
```

```
## Warning in crossval_o2m_adjR2(X, Y, 1:5, 0:4, 0:4, nr_folds = 2): Some combinations of # components exceed data dimensions, these combinations are not considered
```

```
## Minimum is at n = 1
```

```
## Elapsed time: 0.12 sec
```

```
## MSE n nx ny
## 1 2.054871 1 2 2
## 2 2.130012 2 2 1
## 3 2.114823 3 1 1
## 4 2.120295 4 0 1
## 5 2.100264 5 0 0
```

```
## This will give an error!
try(o2m(X, Y, 5, 4, 4))
```

```
## Error in o2m(X, Y, 5, 4, 4) :
## n + max(nx, ny) = 9 exceeds #columns in X or Y
```

With this improvement, cross-validation in OmicsPLS is made slightly easier when data dimensions are not very large.

## Addressing limitation 2 - missing data imputation

The second limitation involved missing data, caused by the stop-if condition `any(is.na(X))`

. This is a standard requirement in many software, since algorithms define arithmetic operations on actual numbers and return `NA`

otherwise. To still enable data analysis with missing data in OmicsPLS, a new function is added: `impute_matrix(X, ...)`

. This matrix is fully powered by `softImpute`

from the `softImpute`

package.

The statistical engine behind the `impute_matrix`

function is imputation by penalized SVD completion. See the article by Mazumder, Hastie and Tibshirani for all details.

Let’s consider an example

```
## Set a seed
set.seed(164378)
## Load the OmicsPLS package
library(OmicsPLS)
## Check the version number, 1.3.0
print(utils::packageVersion("OmicsPLS"))
```

```
## [1] '1.3.0'
```

```
## Generate some random data
X <- scale(matrix(rnorm(100*50), nrow=100))
Y <- scale(matrix(rnorm(100*50), nrow=100))
## Fit OmicsPLS on original data
fit1 <- o2m(X, Y, 3, 3, 3)
## Generate randomly 5% missings, quick and dirty
X[sample(length(X), length(X)/20)] <- NA
## Let's impute and fit again
X_imputed <- scale(impute_matrix(X))
fit2 <- o2m(X_imputed, Y, 3, 3, 3)
```

Now compare the two summary outputs to see how different the results are.

```
summary(fit1)
```

```
##
## *** Summary of the O2PLS fit ***
##
## - Call: o2m(X = X, Y = Y, n = 3, nx = 3, ny = 3)
##
## - Modeled variation
## -- Total variation:
## in X: 4950
## in Y: 4950
##
## -- Joint, Orthogonal and Noise as proportions:
##
## data X data Y
## Joint 0.102 0.104
## Orthogonal 0.125 0.121
## Noise 0.774 0.775
##
## -- Predictable variation in Y-joint part by X-joint part:
## Variation in T*B_T relative to U: 0.701
## -- Predictable variation in X-joint part by Y-joint part:
## Variation in U*B_U relative to T: 0.702
##
## -- Variances per component:
##
## Comp 1 Comp 2 Comp 3
## X joint 158.850 162.439 182.438
## Y joint 202.338 160.845 150.682
##
## Comp 1 Comp 2 Comp 3
## X Orth 223.647 204.768 189.944
##
## Comp 1 Comp 2 Comp 3
## Y Orth 200.265 211.396 191.646
##
##
## - Coefficient in 'U = T B_T + H_U' model:
## -- Diagonal elements of B_T =
## 0.961 0.845 0.737
```

```
summary(fit2)
```

```
##
## *** Summary of the O2PLS fit ***
##
## - Call: o2m(X = X_imputed, Y = Y, n = 3, nx = 3, ny = 3)
##
## - Modeled variation
## -- Total variation:
## in X: 4950
## in Y: 4950
##
## -- Joint, Orthogonal and Noise as proportions:
##
## data X data Y
## Joint 0.104 0.104
## Orthogonal 0.120 0.118
## Noise 0.776 0.778
##
## -- Predictable variation in Y-joint part by X-joint part:
## Variation in T*B_T relative to U: 0.69
## -- Predictable variation in X-joint part by Y-joint part:
## Variation in U*B_U relative to T: 0.696
##
## -- Variances per component:
##
## Comp 1 Comp 2 Comp 3
## X joint 169.108 153.002 193.018
## Y joint 200.414 179.735 136.997
##
## Comp 1 Comp 2 Comp 3
## X Orth 219.594 190.945 192.234
##
## Comp 1 Comp 2 Comp 3
## Y Orth 212.557 186.524 187.125
##
##
## - Coefficient in 'U = T B_T + H_U' model:
## -- Diagonal elements of B_T =
## 0.897 0.91 0.702
```

Note that imputation remains an important step in data analyis, especially when you want to draw hard conclusions about the results. Therefore, it’s still recommended to investigate which imputation method is the best for your data.

## Final remarks

With the new version, I hope that OmicsPLS will be a bit easier to use in more diverse applications with small datasets and missing values. And, as a preview, there is more to come soon! The sparse group O2PLS (GO2PLS) method will be added in OmicsPLS, making it a more complete analysis framework for omics data.

Finally, we are giving a workshop at the online Leeds Annual Statistical Research Workshop demonstrating OmicsPLS using real omics data. Registration and abstract submission is free!