Train an isoTWAS model
pipeline.Rmd
Introduction
This module will cover how to use functions from isotwas
to train an multivariate isotwas
model for isoform-level
expression. Along the way, we’ll be covering some practical
considerations for computation and feature selection.
Train predictive model
To train a multivariate model, we need:
a matrix of dosages of SNPs, and
a matrix of isoform-level expressions for a given gene.
Isoform expression estimated from pseudo-alignment methods like salmon
and
kallisto
provide bootstrapped estimates expression across a number of iterations.
This information can be included into model training, as illustrated
below.
Practical consideration: Ideally, to reduce
your memory usage, we recommend storing your genetic data as
PLINK
format data (ideally as
.bed
/bim
/fam
or
.pgen
/.pvar
/.psam
). For a given
gene, use PLINK
to select only
SNPs on the same chromosome and within 1 Megabase of the gene body.
Then, you can use the bigsnpr
package read in this “condensed” file set. Make sure your isoform
expression matrix is log-transformed and residualized by covariates that
are normally used in a QTL analysis (i.e., age, sex, principal
components of the genotype matrix, PEER
factors or hidden covariates with
prior for the full expression matrix). In order to ensure maximal
overlap between the model and the GWAS we will eventually use for trait
mapping, we suggest restricting the SNPs in the QTL panel to only those
present in the LD reference panel or to those annotated by the HapMap3 Project.
This will ensure proper standard error calculations.
For this vignette about training an model, we’ll use some toy genetic and transcriptomic data provided.
train_bed = system.file("extdata", "train.bed", package = "isotwas")
snps = bigsnpr::snp_attach(bigsnpr::snp_readBed2(train_bed,
backingfile = tempfile()))
dim(snps$genotypes[])
#> [1] 300 4542
isoform_mat = readRDS(system.file("extdata", "isoform_exp.RDS", package = "isotwas"))
dim(isoform_mat)
#> [1] 300 3
This reads in a SNP matrix with dimensions \(300 \times 4542\), with samples along the rows and SNPs along the columns and a isoform-level expression matrix of dimension \(300 \times 3\), with isoforms along the columns.
Practical consideration: Though this is not a
requirement for the isotwas pipeline, we recommend a feature selection
step based on cis-heritability of isoform-level expression. Prior of
model training for isoforms of a given gene, we recommend that users
compute the heritability of all expressed isoforms of a gene using GCTA
(ideally the gcta-nr-robust
executable in FUSION
). We recommend users only train
models for isoforms that have positive heritability at nominal \(P < 0.05\) or \(0.01\).
Practical consideration: Prior to heritability analysis and model training, it is imperative that your expression data is residualized by relevant covariates (i.e. principal components of genotype matrix, hidden covariates or PEER factors of the expression matrix, clinical/demographic covariates). A general rule of thumb is if you would include the covariate in an eQTL/isoform-level eQTL analysis, you should include it in the residualization process.
Let’s jitter our expression matrix a little to generate 10 bootstrapped estimates of the matrix. Keep in mind that these replicates, theoretically, should not affect the effect size estimates but will help in seeding the estimation process of the correlation between isoforms/error in the multivariate model.
boot_list = lapply(1:10,function(i){jitter(isoform_mat)})
isoform_mat_rep = rlist::list.rbind(boot_list)
rownames(isoform_mat_rep) = rep(paste0('Sample',1:nrow(snps$fam)),10)
colnames(isoform_mat_rep) = paste0('Isoform',1:ncol(isoform_mat_rep))
dim(isoform_mat_rep)
#> [1] 3000 3
Now, we can train the model using the compute_isotwas()
function. Make sure the SNP matrix has clear labels for the SNP
identifiers that are relevant to your study. This is a large SNP matrix
which will take a few minutes to run on your local machine.
snp_mat = as.matrix(snps$genotypes[])
colnames(snp_mat) = snps$map$marker.ID
rownames(snp_mat) = snps$fam$sample.ID
isotwas_model = compute_isotwas(X = snp_mat,
Y = isoform_mat,
Y.rep = isoform_mat_rep,
R = 10,
id = rownames(isoform_mat_rep),
omega_est = 'replicates',
omega_nlambda = 5,
method = c('mrce_lasso',
'multi_enet',
'univariate',
'joinet',
'spls'),
predict_nlambda = 10,
family = 'gaussian',
scale = FALSE,
alpha = 0.5,
nfolds = 5,
verbose = TRUE,
tx_names = paste0('Isoform',1:3),
seed = 1789,
run_all = FALSE,
return_all = TRUE)
A few notes on the options in the function:
By selecting
omega_est = 'replicates'
, we are used the bootstrapped replicates to estimate the correlation between isoforms to seed any methods that require this matrix. The alternative isomega_est = 'mean'
, which will use whatever matrix you submit forY
.omega_nlambda
selects the number of steps between fully dense to fully sparse in the correlation matrix estimation. We recommend 5-10 as the default here.There are multiple methods to predict expression. However, through simulations and real data examples, we recommend the five shown here as the default in the
method
option.predict_nlambda
is the number of penalty variables that are considered from themrce_lasso
method. We recommend 10-20 for this variable.family
andalpha
are formulti_enet
andunivariate
to determine how elastic net regression is run. The default is that the response follows a Normal distribution and the mixing parameter is set to 0.5.If
run_all = TRUE
, then all methods will be run.If
return_all = TRUE
, then predicted values for each isoform and predictive performance for all methods and all isoforms will be outputted.
We can convert the model from a list (the output from
compute_isotwas()
) to a tibble or data.frame using
convert_model(isotwas_model)
.
Let’s take a look at what a sample model looks like:
model_isotwas = readRDS(system.file("extdata", "model_example.RDS", package = "isotwas"))
class(model_isotwas)
#> [1] "list"
length(model_isotwas)
#> [1] 2
names(model_isotwas)
#> [1] "Model" "R2"
model_isotwas$Model[[1]]
#> $Transcript
#> [1] "Isoform1"
#>
#> $Model
#> # A tibble: 187 × 2
#> SNP Weight
#> <chr> <dbl>
#> 1 SNP10 -0.0000957
#> 2 SNP19 -0.00158
#> 3 SNP80 -0.00375
#> 4 SNP89 -0.000884
#> 5 SNP94 -0.00504
#> 6 SNP124 0.00586
#> 7 SNP195 -0.0201
#> 8 SNP232 -0.00627
#> 9 SNP233 0.000960
#> 10 SNP261 -0.00688
#> # ℹ 177 more rows
#>
#> $R2
#> [1] 0.3174829
#>
#> $P
#> value
#> 0
#>
#> $Pred
#> IND190 IND380 IND130 IND314 IND462 IND137
#> -0.395051620 -0.405521454 -0.405642239 -0.513143425 -0.219109235 -0.472395293
#> IND191 IND228 IND253 IND358 IND318 IND3
#> -0.186011347 -0.047232532 -0.039820719 -0.203023847 -0.279393749 -0.177154706
#> IND28 IND182 IND91 IND476 IND255 IND209
#> -0.261468441 -0.572125420 -0.351933854 -0.493579401 -0.417614473 -0.059488250
#> IND135 IND223 IND340 IND120 IND115 IND391
#> -0.318086976 -0.319871820 -0.135863999 0.083039071 -0.469286056 -0.281497318
#> IND304 IND163 IND386 IND69 IND121 IND269
#> -0.342756842 -0.169790486 -0.552650084 -0.339994140 0.018921426 -0.359762240
#> IND24 IND31 IND349 IND32 IND332 IND495
#> -0.171493417 -0.450912106 -0.411439831 -0.220513041 -0.445241556 -0.302148360
#> IND328 IND371 IND272 IND23 IND55 IND463
#> -0.572640195 -0.090248595 -0.172602006 -0.107964558 -0.450873553 -0.297293819
#> IND194 IND441 IND113 IND47 IND502 IND409
#> 0.002500033 -0.082561510 -0.053926154 0.014315858 -0.263294983 -0.424747059
#> IND82 IND48 IND303 IND176 IND393 IND52
#> -0.215259827 -0.180472021 -0.566428240 -0.154369420 -0.292469824 -0.125122866
#> IND458 IND426 IND15 IND10 IND211 IND244
#> -0.297137749 -0.571241703 -0.271153164 -0.372562537 -0.139310663 0.058872696
#> IND65 IND50 IND442 IND213 IND157 IND300
#> -0.353977257 -0.370280430 -0.369661968 -0.308463656 -0.015128680 -0.372327867
#> IND25 IND254 IND49 IND199 IND189 IND295
#> -0.294685131 0.207142198 -0.265891894 -0.210070398 -0.125493031 -0.208084001
#> IND88 IND459 IND339 IND46 IND325 IND419
#> -0.188952067 -0.341369214 -0.323699038 -0.194042453 -0.177038660 0.040164417
#> IND302 IND216 IND178 IND282 IND186 IND239
#> -0.388742683 -0.144205644 -0.215211141 -0.347436381 0.070054061 -0.016246983
#> IND89 IND71 IND33 IND494 IND369 IND109
#> -0.558938422 -0.009956788 -0.518804964 -0.397442503 -0.462158482 -0.620882241
#> IND64 IND364 IND1 IND323 IND60 IND151
#> -0.470768744 -0.505031962 -0.474205183 -0.369106906 -0.183303208 -0.474395370
#> IND293 IND437 IND85 IND324 IND375 IND423
#> -0.216436666 -0.452684526 -0.437603184 -0.588708106 -0.588703304 -0.443682464
#> IND367 IND516 IND95 IND461 IND451 IND76
#> -0.601498157 -0.233798606 -0.410275308 -0.511844894 -0.326213821 -0.370696978
#> IND298 IND310 IND222 IND102 IND491 IND422
#> -0.502436692 -0.408828852 -0.111287912 -0.226551260 -0.223498530 0.158679573
#> IND351 IND309 IND277 IND214 IND0 IND438
#> -0.344041151 -0.424716100 -0.612379239 -0.231148880 -0.417714624 -0.438855224
#> IND453 IND400 IND443 IND227 IND331 IND169
#> -0.594149200 -0.385144435 -0.456708590 -0.532575545 -0.315295504 -0.235696550
#> IND197 IND142 IND416 IND51 IND346 IND27
#> -0.298466128 0.034104976 -0.526523901 0.508010660 -0.126042297 -0.128398444
#> IND297 IND479 IND488 IND290 IND44 IND308
#> -0.259539348 -0.454747553 -0.504412520 -0.510470021 -0.142340184 0.072124794
#> IND486 IND7 IND284 IND444 IND220 IND374
#> -0.329956853 -0.350941227 -0.163496962 -0.413414477 -0.593246582 -0.818095517
#> IND500 IND11 IND354 IND246 IND439 IND392
#> -0.501920608 -0.214219476 0.007729149 -0.275404004 -0.644399750 -0.285348795
#> IND448 IND506 IND192 IND484 IND203 IND452
#> -0.492513383 -0.310190245 -0.024709405 -0.435482280 -0.269321957 -0.552898172
#> IND92 IND134 IND287 IND424 IND307 IND122
#> -0.101055241 0.008905316 -0.306358542 -0.461001763 -0.356879362 0.058371001
#> IND414 IND212 IND20 IND127 IND236 IND258
#> -0.311646715 -0.111162671 -0.193471243 0.028880043 -0.407672480 0.142610577
#> IND305 IND268 IND513 IND496 IND147 IND322
#> -0.340565157 -0.471679317 -0.404866473 -0.395120760 -0.072070907 -0.277843970
#> IND45 IND68 IND201 IND348 IND43 IND75
#> 0.035081966 -0.363338384 -0.652974372 -0.412158936 -0.071163309 -0.080312324
#> IND90 IND41 IND387 IND276 IND465 IND224
#> 0.025514564 -0.323200418 -0.477190145 -0.154720858 -0.341964318 -0.196673915
#> IND481 IND143 IND226 IND480 IND126 IND498
#> -0.663275058 -0.056703898 -0.082764588 -0.210979906 0.048488465 -0.283945815
#> IND233 IND195 IND61 IND173 IND97 IND478
#> -0.266472867 -0.217675991 0.014263476 -0.454493334 -0.550484287 -0.041911623
#> IND18 IND397 IND180 IND315 IND338 IND359
#> -0.330665666 -0.436392508 -0.609082779 -0.569036586 -0.212491654 -0.343991921
#> IND508 IND101 IND232 IND454 IND145 IND370
#> -0.477263014 -0.464338378 -0.072071562 -0.227516541 -0.389293368 -0.173894914
#> IND159 IND12 IND77 IND421 IND112 IND289
#> 0.119232022 -0.094999753 -0.481360824 -0.433878747 -0.100941994 -0.581080319
#> IND353 IND235 IND29 IND221 IND335 IND249
#> -0.332866777 -0.078403530 -0.340509458 -0.130391230 -0.223907121 -0.298415252
#> IND84 IND296 IND183 IND344 IND259 IND384
#> -0.199496411 -0.410031721 -0.377059571 -0.693998245 -0.370623474 -0.246649423
#> IND148 IND485 IND264 IND54 IND119 IND146
#> -0.271335029 -0.225853454 -0.261874967 0.087807798 -0.158416463 -0.214176185
#> IND93 IND80 IND16 IND350 IND503 IND490
#> -0.420587615 -0.521984344 -0.070014457 -0.477095206 -0.366749255 -0.412342395
#> IND270 IND70 IND385 IND433 IND428 IND280
#> -0.502675568 -0.087020068 -0.312957603 -0.141636588 -0.630821691 -0.190763136
#> IND234 IND104 IND155 IND417 IND207 IND67
#> -0.145463209 -0.065338172 -0.102717325 -0.406053957 -0.487929907 -0.542512235
#> IND471 IND6 IND413 IND412 IND245 IND96
#> 0.034250790 -0.327396727 -0.497126708 -0.185326538 -0.331678057 -0.078888704
#> IND63 IND301 IND401 IND263 IND352 IND316
#> -0.238519359 -0.297359129 -0.424067348 -0.282617895 -0.050625075 -0.255118778
#> IND341 IND489 IND388 IND457 IND181 IND431
#> -0.459988827 -0.313308038 -0.311893938 0.040361232 -0.454678575 -0.575772676
#> IND140 IND512 IND330 IND420 IND377 IND139
#> -0.218079182 -0.400390897 -0.392403815 -0.258866066 -0.353999148 0.032783885
#> IND171 IND165 IND395 IND436 IND475 IND167
#> 0.002110163 -0.083110045 -0.540800075 -0.680934271 -0.444409304 -0.160315170
#> IND111 IND473 IND379 IND501 IND164 IND360
#> -0.189364791 -0.415996781 -0.534266922 -0.437482227 -0.269667607 -0.244362503
#> IND390 IND376 IND425 IND66 IND243 IND39
#> -0.016687981 -0.300365083 -0.559132052 -0.397677303 -0.253962997 -0.157969029
#> IND125 IND59 IND218 IND361 IND469 IND321
#> 0.198755812 -0.270404134 -0.157473118 -0.435929952 -0.198513588 -0.471754809
This model object is a list with a Model
and a
R2
slot. The Model
slot stores a list of
models for each isoform.
We can convert this model from a list to a tibble
using
the convert_model()
function. Make sure you have an
annotation object to have position and REF/ALT allele information.
model_tsv = convert_model(model_isotwas,
snp_annot = snps$map,
snp_var = 'marker.ID')
model_tsv
#> # A tibble: 700 × 10
#> SNP Weight Transcript R2 R2.P chromosome genetic.dist physical.pos
#> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
#> 1 SNP10 -9.57e-5 Isoform1 0.317 0 1 0 10099
#> 2 SNP100 -2.42e-3 Isoform3 0.425 0 1 0 100107
#> 3 SNP1009 2.15e-3 Isoform3 0.425 0 1 0 1009108
#> 4 SNP101 8.68e-3 Isoform2 0.463 0 1 0 101086
#> 5 SNP1012 -1.99e-2 Isoform1 0.317 0 1 0 1012093
#> 6 SNP1017 -7.31e-3 Isoform2 0.463 0 1 0 1017093
#> 7 SNP1020 4.89e-3 Isoform2 0.463 0 1 0 1020103
#> 8 SNP1025 3.36e-2 Isoform1 0.317 0 1 0 1025113
#> 9 SNP103 -5.00e-2 Isoform2 0.463 0 1 0 103078
#> 10 SNP1042 -1.09e-2 Isoform2 0.463 0 1 0 1042088
#> # ℹ 690 more rows
#> # ℹ 2 more variables: allele1 <chr>, allele2 <chr>