Train an isoTWAS model • isotwas

Introduction

This module will cover how to use functions from isotwas to train an multivariate isotwas model for isoform-level expression. Along the way, we’ll be covering some practical considerations for computation and feature selection.

Train predictive model

To train a multivariate model, we need:

a matrix of dosages of SNPs, and
a matrix of isoform-level expressions for a given gene.

Isoform expression estimated from pseudo-alignment methods like salmon and kallisto provide bootstrapped estimates expression across a number of iterations. This information can be included into model training, as illustrated below.

Practical consideration: Ideally, to reduce your memory usage, we recommend storing your genetic data as PLINK format data (ideally as .bed/bim/fam or .pgen/.pvar/.psam). For a given gene, use PLINK to select only SNPs on the same chromosome and within 1 Megabase of the gene body. Then, you can use the bigsnpr package read in this “condensed” file set. Make sure your isoform expression matrix is log-transformed and residualized by covariates that are normally used in a QTL analysis (i.e., age, sex, principal components of the genotype matrix, PEER factors or hidden covariates with prior for the full expression matrix). In order to ensure maximal overlap between the model and the GWAS we will eventually use for trait mapping, we suggest restricting the SNPs in the QTL panel to only those present in the LD reference panel or to those annotated by the HapMap3 Project. This will ensure proper standard error calculations.

For this vignette about training an model, we’ll use some toy genetic and transcriptomic data provided.

train_bed = system.file("extdata", "train.bed", package = "isotwas")
snps = bigsnpr::snp_attach(bigsnpr::snp_readBed2(train_bed,
                                                 backingfile = tempfile()))
dim(snps$genotypes[])
#> [1]  300 4542

isoform_mat = readRDS(system.file("extdata", "isoform_exp.RDS", package = "isotwas"))
dim(isoform_mat)
#> [1] 300   3

This reads in a SNP matrix with dimensions \(300 \times 4542\), with samples along the rows and SNPs along the columns and a isoform-level expression matrix of dimension \(300 \times 3\), with isoforms along the columns.

Practical consideration: Though this is not a requirement for the isotwas pipeline, we recommend a feature selection step based on cis-heritability of isoform-level expression. Prior of model training for isoforms of a given gene, we recommend that users compute the heritability of all expressed isoforms of a gene using GCTA (ideally the gcta-nr-robust executable in FUSION). We recommend users only train models for isoforms that have positive heritability at nominal \(P < 0.05\) or \(0.01\).

Practical consideration: Prior to heritability analysis and model training, it is imperative that your expression data is residualized by relevant covariates (i.e. principal components of genotype matrix, hidden covariates or PEER factors of the expression matrix, clinical/demographic covariates). A general rule of thumb is if you would include the covariate in an eQTL/isoform-level eQTL analysis, you should include it in the residualization process.

Let’s jitter our expression matrix a little to generate 10 bootstrapped estimates of the matrix. Keep in mind that these replicates, theoretically, should not affect the effect size estimates but will help in seeding the estimation process of the correlation between isoforms/error in the multivariate model.

boot_list = lapply(1:10,function(i){jitter(isoform_mat)}) 
isoform_mat_rep = rlist::list.rbind(boot_list) 
rownames(isoform_mat_rep) = rep(paste0('Sample',1:nrow(snps$fam)),10) 
colnames(isoform_mat_rep) = paste0('Isoform',1:ncol(isoform_mat_rep)) 
dim(isoform_mat_rep)
#> [1] 3000    3

Now, we can train the model using the compute_isotwas() function. Make sure the SNP matrix has clear labels for the SNP identifiers that are relevant to your study. This is a large SNP matrix which will take a few minutes to run on your local machine.

snp_mat = as.matrix(snps$genotypes[])
colnames(snp_mat) = snps$map$marker.ID
rownames(snp_mat) = snps$fam$sample.ID
isotwas_model = compute_isotwas(X = snp_mat, 
                                Y = isoform_mat, 
                                Y.rep = isoform_mat_rep,
                                R = 10, 
                                id = rownames(isoform_mat_rep), 
                                omega_est = 'replicates', 
                                omega_nlambda = 5, 
                                method = c('mrce_lasso', 
                                           'multi_enet', 
                                           'univariate',
                                           'joinet',
                                           'spls'),
                                predict_nlambda = 10, 
                                family = 'gaussian', 
                                scale = FALSE, 
                                alpha = 0.5, 
                                nfolds = 5, 
                                verbose = TRUE, 
                                tx_names = paste0('Isoform',1:3), 
                                seed = 1789, 
                                run_all = FALSE, 
                                return_all = TRUE)

A few notes on the options in the function:

By selecting omega_est = 'replicates' , we are used the bootstrapped replicates to estimate the correlation between isoforms to seed any methods that require this matrix. The alternative is omega_est = 'mean' , which will use whatever matrix you submit for Y.
omega_nlambda selects the number of steps between fully dense to fully sparse in the correlation matrix estimation. We recommend 5-10 as the default here.
There are multiple methods to predict expression. However, through simulations and real data examples, we recommend the five shown here as the default in the method option.
predict_nlambda is the number of penalty variables that are considered from the mrce_lasso method. We recommend 10-20 for this variable.
family and alpha are for multi_enet and univariate to determine how elastic net regression is run. The default is that the response follows a Normal distribution and the mixing parameter is set to 0.5.
If run_all = TRUE, then all methods will be run.
If return_all = TRUE, then predicted values for each isoform and predictive performance for all methods and all isoforms will be outputted.

We can convert the model from a list (the output from compute_isotwas()) to a tibble or data.frame using convert_model(isotwas_model).

Let’s take a look at what a sample model looks like:

model_isotwas = readRDS(system.file("extdata", "model_example.RDS", package = "isotwas"))
class(model_isotwas)
#> [1] "list"
length(model_isotwas)
#> [1] 2
names(model_isotwas)
#> [1] "Model" "R2"
model_isotwas$Model[[1]]
#> $Transcript
#> [1] "Isoform1"
#> 
#> $Model
#> # A tibble: 187 × 2
#>    SNP        Weight
#>    <chr>       <dbl>
#>  1 SNP10  -0.0000957
#>  2 SNP19  -0.00158  
#>  3 SNP80  -0.00375  
#>  4 SNP89  -0.000884 
#>  5 SNP94  -0.00504  
#>  6 SNP124  0.00586  
#>  7 SNP195 -0.0201   
#>  8 SNP232 -0.00627  
#>  9 SNP233  0.000960 
#> 10 SNP261 -0.00688  
#> # ℹ 177 more rows
#> 
#> $R2
#> [1] 0.3174829
#> 
#> $P
#> value 
#>     0 
#> 
#> $Pred
#>       IND190       IND380       IND130       IND314       IND462       IND137 
#> -0.395051620 -0.405521454 -0.405642239 -0.513143425 -0.219109235 -0.472395293 
#>       IND191       IND228       IND253       IND358       IND318         IND3 
#> -0.186011347 -0.047232532 -0.039820719 -0.203023847 -0.279393749 -0.177154706 
#>        IND28       IND182        IND91       IND476       IND255       IND209 
#> -0.261468441 -0.572125420 -0.351933854 -0.493579401 -0.417614473 -0.059488250 
#>       IND135       IND223       IND340       IND120       IND115       IND391 
#> -0.318086976 -0.319871820 -0.135863999  0.083039071 -0.469286056 -0.281497318 
#>       IND304       IND163       IND386        IND69       IND121       IND269 
#> -0.342756842 -0.169790486 -0.552650084 -0.339994140  0.018921426 -0.359762240 
#>        IND24        IND31       IND349        IND32       IND332       IND495 
#> -0.171493417 -0.450912106 -0.411439831 -0.220513041 -0.445241556 -0.302148360 
#>       IND328       IND371       IND272        IND23        IND55       IND463 
#> -0.572640195 -0.090248595 -0.172602006 -0.107964558 -0.450873553 -0.297293819 
#>       IND194       IND441       IND113        IND47       IND502       IND409 
#>  0.002500033 -0.082561510 -0.053926154  0.014315858 -0.263294983 -0.424747059 
#>        IND82        IND48       IND303       IND176       IND393        IND52 
#> -0.215259827 -0.180472021 -0.566428240 -0.154369420 -0.292469824 -0.125122866 
#>       IND458       IND426        IND15        IND10       IND211       IND244 
#> -0.297137749 -0.571241703 -0.271153164 -0.372562537 -0.139310663  0.058872696 
#>        IND65        IND50       IND442       IND213       IND157       IND300 
#> -0.353977257 -0.370280430 -0.369661968 -0.308463656 -0.015128680 -0.372327867 
#>        IND25       IND254        IND49       IND199       IND189       IND295 
#> -0.294685131  0.207142198 -0.265891894 -0.210070398 -0.125493031 -0.208084001 
#>        IND88       IND459       IND339        IND46       IND325       IND419 
#> -0.188952067 -0.341369214 -0.323699038 -0.194042453 -0.177038660  0.040164417 
#>       IND302       IND216       IND178       IND282       IND186       IND239 
#> -0.388742683 -0.144205644 -0.215211141 -0.347436381  0.070054061 -0.016246983 
#>        IND89        IND71        IND33       IND494       IND369       IND109 
#> -0.558938422 -0.009956788 -0.518804964 -0.397442503 -0.462158482 -0.620882241 
#>        IND64       IND364         IND1       IND323        IND60       IND151 
#> -0.470768744 -0.505031962 -0.474205183 -0.369106906 -0.183303208 -0.474395370 
#>       IND293       IND437        IND85       IND324       IND375       IND423 
#> -0.216436666 -0.452684526 -0.437603184 -0.588708106 -0.588703304 -0.443682464 
#>       IND367       IND516        IND95       IND461       IND451        IND76 
#> -0.601498157 -0.233798606 -0.410275308 -0.511844894 -0.326213821 -0.370696978 
#>       IND298       IND310       IND222       IND102       IND491       IND422 
#> -0.502436692 -0.408828852 -0.111287912 -0.226551260 -0.223498530  0.158679573 
#>       IND351       IND309       IND277       IND214         IND0       IND438 
#> -0.344041151 -0.424716100 -0.612379239 -0.231148880 -0.417714624 -0.438855224 
#>       IND453       IND400       IND443       IND227       IND331       IND169 
#> -0.594149200 -0.385144435 -0.456708590 -0.532575545 -0.315295504 -0.235696550 
#>       IND197       IND142       IND416        IND51       IND346        IND27 
#> -0.298466128  0.034104976 -0.526523901  0.508010660 -0.126042297 -0.128398444 
#>       IND297       IND479       IND488       IND290        IND44       IND308 
#> -0.259539348 -0.454747553 -0.504412520 -0.510470021 -0.142340184  0.072124794 
#>       IND486         IND7       IND284       IND444       IND220       IND374 
#> -0.329956853 -0.350941227 -0.163496962 -0.413414477 -0.593246582 -0.818095517 
#>       IND500        IND11       IND354       IND246       IND439       IND392 
#> -0.501920608 -0.214219476  0.007729149 -0.275404004 -0.644399750 -0.285348795 
#>       IND448       IND506       IND192       IND484       IND203       IND452 
#> -0.492513383 -0.310190245 -0.024709405 -0.435482280 -0.269321957 -0.552898172 
#>        IND92       IND134       IND287       IND424       IND307       IND122 
#> -0.101055241  0.008905316 -0.306358542 -0.461001763 -0.356879362  0.058371001 
#>       IND414       IND212        IND20       IND127       IND236       IND258 
#> -0.311646715 -0.111162671 -0.193471243  0.028880043 -0.407672480  0.142610577 
#>       IND305       IND268       IND513       IND496       IND147       IND322 
#> -0.340565157 -0.471679317 -0.404866473 -0.395120760 -0.072070907 -0.277843970 
#>        IND45        IND68       IND201       IND348        IND43        IND75 
#>  0.035081966 -0.363338384 -0.652974372 -0.412158936 -0.071163309 -0.080312324 
#>        IND90        IND41       IND387       IND276       IND465       IND224 
#>  0.025514564 -0.323200418 -0.477190145 -0.154720858 -0.341964318 -0.196673915 
#>       IND481       IND143       IND226       IND480       IND126       IND498 
#> -0.663275058 -0.056703898 -0.082764588 -0.210979906  0.048488465 -0.283945815 
#>       IND233       IND195        IND61       IND173        IND97       IND478 
#> -0.266472867 -0.217675991  0.014263476 -0.454493334 -0.550484287 -0.041911623 
#>        IND18       IND397       IND180       IND315       IND338       IND359 
#> -0.330665666 -0.436392508 -0.609082779 -0.569036586 -0.212491654 -0.343991921 
#>       IND508       IND101       IND232       IND454       IND145       IND370 
#> -0.477263014 -0.464338378 -0.072071562 -0.227516541 -0.389293368 -0.173894914 
#>       IND159        IND12        IND77       IND421       IND112       IND289 
#>  0.119232022 -0.094999753 -0.481360824 -0.433878747 -0.100941994 -0.581080319 
#>       IND353       IND235        IND29       IND221       IND335       IND249 
#> -0.332866777 -0.078403530 -0.340509458 -0.130391230 -0.223907121 -0.298415252 
#>        IND84       IND296       IND183       IND344       IND259       IND384 
#> -0.199496411 -0.410031721 -0.377059571 -0.693998245 -0.370623474 -0.246649423 
#>       IND148       IND485       IND264        IND54       IND119       IND146 
#> -0.271335029 -0.225853454 -0.261874967  0.087807798 -0.158416463 -0.214176185 
#>        IND93        IND80        IND16       IND350       IND503       IND490 
#> -0.420587615 -0.521984344 -0.070014457 -0.477095206 -0.366749255 -0.412342395 
#>       IND270        IND70       IND385       IND433       IND428       IND280 
#> -0.502675568 -0.087020068 -0.312957603 -0.141636588 -0.630821691 -0.190763136 
#>       IND234       IND104       IND155       IND417       IND207        IND67 
#> -0.145463209 -0.065338172 -0.102717325 -0.406053957 -0.487929907 -0.542512235 
#>       IND471         IND6       IND413       IND412       IND245        IND96 
#>  0.034250790 -0.327396727 -0.497126708 -0.185326538 -0.331678057 -0.078888704 
#>        IND63       IND301       IND401       IND263       IND352       IND316 
#> -0.238519359 -0.297359129 -0.424067348 -0.282617895 -0.050625075 -0.255118778 
#>       IND341       IND489       IND388       IND457       IND181       IND431 
#> -0.459988827 -0.313308038 -0.311893938  0.040361232 -0.454678575 -0.575772676 
#>       IND140       IND512       IND330       IND420       IND377       IND139 
#> -0.218079182 -0.400390897 -0.392403815 -0.258866066 -0.353999148  0.032783885 
#>       IND171       IND165       IND395       IND436       IND475       IND167 
#>  0.002110163 -0.083110045 -0.540800075 -0.680934271 -0.444409304 -0.160315170 
#>       IND111       IND473       IND379       IND501       IND164       IND360 
#> -0.189364791 -0.415996781 -0.534266922 -0.437482227 -0.269667607 -0.244362503 
#>       IND390       IND376       IND425        IND66       IND243        IND39 
#> -0.016687981 -0.300365083 -0.559132052 -0.397677303 -0.253962997 -0.157969029 
#>       IND125        IND59       IND218       IND361       IND469       IND321 
#>  0.198755812 -0.270404134 -0.157473118 -0.435929952 -0.198513588 -0.471754809

This model object is a list with a Model and a R2 slot. The Model slot stores a list of models for each isoform.

We can convert this model from a list to a tibble using the convert_model() function. Make sure you have an annotation object to have position and REF/ALT allele information.

model_tsv = convert_model(model_isotwas,
                          snp_annot = snps$map,
                          snp_var = 'marker.ID')
model_tsv
#> # A tibble: 700 × 10
#>    SNP        Weight Transcript    R2  R2.P chromosome genetic.dist physical.pos
#>    <chr>       <dbl> <chr>      <dbl> <dbl>      <int>        <int>        <int>
#>  1 SNP10    -9.57e-5 Isoform1   0.317     0          1            0        10099
#>  2 SNP100   -2.42e-3 Isoform3   0.425     0          1            0       100107
#>  3 SNP1009   2.15e-3 Isoform3   0.425     0          1            0      1009108
#>  4 SNP101    8.68e-3 Isoform2   0.463     0          1            0       101086
#>  5 SNP1012  -1.99e-2 Isoform1   0.317     0          1            0      1012093
#>  6 SNP1017  -7.31e-3 Isoform2   0.463     0          1            0      1017093
#>  7 SNP1020   4.89e-3 Isoform2   0.463     0          1            0      1020103
#>  8 SNP1025   3.36e-2 Isoform1   0.317     0          1            0      1025113
#>  9 SNP103   -5.00e-2 Isoform2   0.463     0          1            0       103078
#> 10 SNP1042  -1.09e-2 Isoform2   0.463     0          1            0      1042088
#> # ℹ 690 more rows
#> # ℹ 2 more variables: allele1 <chr>, allele2 <chr>