Train an isoTWAS model
pipeline.RmdIntroduction
This module will cover how to use functions from isotwas
to train an multivariate isotwas model for isoform-level
expression. Along the way, we’ll be covering some practical
considerations for computation and feature selection.
Train predictive model
To train a multivariate model, we need:
a matrix of dosages of SNPs, and
a matrix of isoform-level expressions for a given gene.
Isoform expression estimated from pseudo-alignment methods like salmon and
kallisto
provide bootstrapped estimates expression across a number of iterations.
This information can be included into model training, as illustrated
below.
Practical consideration: Ideally, to reduce
your memory usage, we recommend storing your genetic data as
PLINK format data (ideally as
.bed/bim/fam or
.pgen/.pvar/.psam). For a given
gene, use PLINK to select only
SNPs on the same chromosome and within 1 Megabase of the gene body.
Then, you can use the bigsnpr
package read in this “condensed” file set. Make sure your isoform
expression matrix is log-transformed and residualized by covariates that
are normally used in a QTL analysis (i.e., age, sex, principal
components of the genotype matrix, PEER
factors or hidden covariates with
prior for the full expression matrix). In order to ensure maximal
overlap between the model and the GWAS we will eventually use for trait
mapping, we suggest restricting the SNPs in the QTL panel to only those
present in the LD reference panel or to those annotated by the HapMap3 Project.
This will ensure proper standard error calculations.
For this vignette about training an model, we’ll use some toy genetic and transcriptomic data provided.
train_bed = system.file("extdata", "train.bed", package = "isotwas")
snps = bigsnpr::snp_attach(bigsnpr::snp_readBed2(train_bed,
backingfile = tempfile()))
dim(snps$genotypes[])
#> [1] 300 4542
isoform_mat = readRDS(system.file("extdata", "isoform_exp.RDS", package = "isotwas"))
dim(isoform_mat)
#> [1] 300 3This reads in a SNP matrix with dimensions , with samples along the rows and SNPs along the columns and a isoform-level expression matrix of dimension , with isoforms along the columns.
Practical consideration: Though this is not a
requirement for the isotwas pipeline, we recommend a feature selection
step based on cis-heritability of isoform-level expression. Prior of
model training for isoforms of a given gene, we recommend that users
compute the heritability of all expressed isoforms of a gene using GCTA
(ideally the gcta-nr-robust
executable in FUSION). We recommend users only train
models for isoforms that have positive heritability at nominal
or
.
Practical consideration: Prior to heritability analysis and model training, it is imperative that your expression data is residualized by relevant covariates (i.e. principal components of genotype matrix, hidden covariates or PEER factors of the expression matrix, clinical/demographic covariates). A general rule of thumb is if you would include the covariate in an eQTL/isoform-level eQTL analysis, you should include it in the residualization process.
Let’s jitter our expression matrix a little to generate 10 bootstrapped estimates of the matrix. Keep in mind that these replicates, theoretically, should not affect the effect size estimates but will help in seeding the estimation process of the correlation between isoforms/error in the multivariate model.
Available Prediction Methods
The compute_isotwas() function supports 9 multivariate
prediction methods:
| Method | Description |
|---|---|
mrce_lasso |
Multivariate Regression with Covariance Estimation (MRCE) - jointly estimates regression coefficients and error covariance |
multi_enet |
Multivariate elastic net - fits elastic net for each isoform with shared penalty |
joinet |
Joinet stacked elastic net - uses cross-predictions to improve multivariate prediction |
spls |
Sparse partial least squares - finds latent components that predict all isoforms |
sgl |
Sparse group lasso - encourages both within-group and across-group sparsity |
mtlasso |
Multi-task lasso (L21) - joint feature selection across all isoforms |
stacking |
Super learner stacking - combines predictions from multiple base methods |
graph_reg |
Graph-regularized regression - incorporates isoform similarity structure |
univariate |
Best of univariate methods (elastic net, BLUP, SuSiE) per isoform |
Training a Model
Now, we can train the model using the compute_isotwas()
function. Make sure the SNP matrix has clear labels for the SNP
identifiers that are relevant to your study.
The function now provides detailed progress output by default
(verbose = TRUE), showing:
- Data summary (samples, SNPs, transcripts)
- Method-by-method progress with timing
- Per-method R² performance
- Final results comparison table
snp_mat = as.matrix(snps$genotypes[])
colnames(snp_mat) = snps$map$marker.ID
rownames(snp_mat) = snps$fam$sample.ID
# Train with all methods (recommended for best model selection)
isotwas_model = compute_isotwas(
X = snp_mat,
Y = isoform_mat,
Y.rep = isoform_mat_rep,
R = 10,
id = rownames(isoform_mat_rep),
omega_est = 'replicates',
omega_nlambda = 5,
nfolds = 5,
verbose = TRUE,
seed = 1789,
run_all = TRUE, # Run all 9 methods
return_all = TRUE # Return results from all methods
)Running Specific Methods
If you want to run only specific methods (e.g., for faster
computation), set run_all = FALSE and specify the
methods:
# Run only a subset of methods
isotwas_model = compute_isotwas(
X = snp_mat,
Y = isoform_mat,
Y.rep = isoform_mat_rep,
R = 10,
id = rownames(isoform_mat_rep),
method = c('mrce_lasso', 'multi_enet', 'sgl', 'univariate'),
nfolds = 5,
verbose = TRUE,
seed = 1789,
run_all = FALSE
)Using Graph-Regularized Regression
The graph_reg method can incorporate prior knowledge
about isoform similarity (e.g., shared exon structure). You can create a
similarity matrix from a GTF annotation file:
# Create similarity matrix from GTF (based on shared exons)
sim_result = similarity_from_gtf(
gtf_path = "gencode.v45.annotation.gtf.gz",
gene = "BRCA1",
method = "jaccard" # Options: "jaccard", "overlap_coef", "binary"
)
# Use in model training
isotwas_model = compute_isotwas(
X = snp_mat,
Y = isoform_mat,
Y.rep = isoform_mat_rep,
R = 10,
id = rownames(isoform_mat_rep),
similarity_matrix = sim_result$similarity_matrix,
method = c('multi_enet', 'graph_reg', 'univariate'),
run_all = FALSE,
verbose = TRUE,
seed = 1789
)Function Parameters
A few notes on the key options:
omega_est: By selectingomega_est = 'replicates', we use the bootstrapped replicates to estimate the correlation between isoforms. The alternative isomega_est = 'mean', which will use whatever matrix you submit forY.omega_nlambda: Number of steps between fully dense to fully sparse in the correlation matrix estimation. We recommend 5-10 as the default.method: Vector of methods to use. By default withrun_all = TRUE, all 9 methods are run and the best model is selected.similarity_matrix: Optional matrix forgraph_regmethod. Can be created usingsimilarity_from_gtf()orcreate_similarity_from_exons().predict_nlambda: Number of penalty variables formrce_lasso. We recommend 10-20.alpha: Elastic net mixing parameter (0 = ridge, 1 = lasso, 0.5 = elastic net).verbose: IfTRUE(default), prints detailed progress including per-method timing and performance.run_all: IfTRUE, all methods will be run regardless ofmethodparameter.return_all: IfTRUE, results from all methods are returned (useful for method comparison).
Examining Results
The output from compute_isotwas() is an
isotwas_result object with several components:
# Print summary
print(isotwas_model)
# Access the comparison table (R² for each method)
isotwas_model$comparison
# Access the best model
isotwas_model$best_models
# Access individual transcript results
isotwas_model$best_models$Isoform1$r2
isotwas_model$best_models$Isoform1$weights
# Get the full weight matrix
weight_mat = get_weight_matrix(isotwas_model$best_models)Converting the Model
We can convert the model from a list to a tibble or data.frame using
convert_model():
Let’s take a look at what a sample model looks like:
model_isotwas = readRDS(system.file("extdata", "model_example.RDS", package = "isotwas"))
class(model_isotwas)
#> [1] "list"
length(model_isotwas)
#> [1] 2
names(model_isotwas)
#> [1] "Model" "R2"
model_isotwas$Model[[1]]
#> $Transcript
#> [1] "Isoform1"
#>
#> $Model
#> # A tibble: 187 × 2
#> SNP Weight
#> <chr> <dbl>
#> 1 SNP10 -0.0000957
#> 2 SNP19 -0.00158
#> 3 SNP80 -0.00375
#> 4 SNP89 -0.000884
#> 5 SNP94 -0.00504
#> 6 SNP124 0.00586
#> 7 SNP195 -0.0201
#> 8 SNP232 -0.00627
#> 9 SNP233 0.000960
#> 10 SNP261 -0.00688
#> # ℹ 177 more rows
#>
#> $R2
#> [1] 0.3174829
#>
#> $P
#> value
#> 0
#>
#> $Pred
#> IND190 IND380 IND130 IND314 IND462 IND137
#> -0.395051620 -0.405521454 -0.405642239 -0.513143425 -0.219109235 -0.472395293
#> IND191 IND228 IND253 IND358 IND318 IND3
#> -0.186011347 -0.047232532 -0.039820719 -0.203023847 -0.279393749 -0.177154706
#> IND28 IND182 IND91 IND476 IND255 IND209
#> -0.261468441 -0.572125420 -0.351933854 -0.493579401 -0.417614473 -0.059488250
#> IND135 IND223 IND340 IND120 IND115 IND391
#> -0.318086976 -0.319871820 -0.135863999 0.083039071 -0.469286056 -0.281497318
#> IND304 IND163 IND386 IND69 IND121 IND269
#> -0.342756842 -0.169790486 -0.552650084 -0.339994140 0.018921426 -0.359762240
#> IND24 IND31 IND349 IND32 IND332 IND495
#> -0.171493417 -0.450912106 -0.411439831 -0.220513041 -0.445241556 -0.302148360
#> IND328 IND371 IND272 IND23 IND55 IND463
#> -0.572640195 -0.090248595 -0.172602006 -0.107964558 -0.450873553 -0.297293819
#> IND194 IND441 IND113 IND47 IND502 IND409
#> 0.002500033 -0.082561510 -0.053926154 0.014315858 -0.263294983 -0.424747059
#> IND82 IND48 IND303 IND176 IND393 IND52
#> -0.215259827 -0.180472021 -0.566428240 -0.154369420 -0.292469824 -0.125122866
#> IND458 IND426 IND15 IND10 IND211 IND244
#> -0.297137749 -0.571241703 -0.271153164 -0.372562537 -0.139310663 0.058872696
#> IND65 IND50 IND442 IND213 IND157 IND300
#> -0.353977257 -0.370280430 -0.369661968 -0.308463656 -0.015128680 -0.372327867
#> IND25 IND254 IND49 IND199 IND189 IND295
#> -0.294685131 0.207142198 -0.265891894 -0.210070398 -0.125493031 -0.208084001
#> IND88 IND459 IND339 IND46 IND325 IND419
#> -0.188952067 -0.341369214 -0.323699038 -0.194042453 -0.177038660 0.040164417
#> IND302 IND216 IND178 IND282 IND186 IND239
#> -0.388742683 -0.144205644 -0.215211141 -0.347436381 0.070054061 -0.016246983
#> IND89 IND71 IND33 IND494 IND369 IND109
#> -0.558938422 -0.009956788 -0.518804964 -0.397442503 -0.462158482 -0.620882241
#> IND64 IND364 IND1 IND323 IND60 IND151
#> -0.470768744 -0.505031962 -0.474205183 -0.369106906 -0.183303208 -0.474395370
#> IND293 IND437 IND85 IND324 IND375 IND423
#> -0.216436666 -0.452684526 -0.437603184 -0.588708106 -0.588703304 -0.443682464
#> IND367 IND516 IND95 IND461 IND451 IND76
#> -0.601498157 -0.233798606 -0.410275308 -0.511844894 -0.326213821 -0.370696978
#> IND298 IND310 IND222 IND102 IND491 IND422
#> -0.502436692 -0.408828852 -0.111287912 -0.226551260 -0.223498530 0.158679573
#> IND351 IND309 IND277 IND214 IND0 IND438
#> -0.344041151 -0.424716100 -0.612379239 -0.231148880 -0.417714624 -0.438855224
#> IND453 IND400 IND443 IND227 IND331 IND169
#> -0.594149200 -0.385144435 -0.456708590 -0.532575545 -0.315295504 -0.235696550
#> IND197 IND142 IND416 IND51 IND346 IND27
#> -0.298466128 0.034104976 -0.526523901 0.508010660 -0.126042297 -0.128398444
#> IND297 IND479 IND488 IND290 IND44 IND308
#> -0.259539348 -0.454747553 -0.504412520 -0.510470021 -0.142340184 0.072124794
#> IND486 IND7 IND284 IND444 IND220 IND374
#> -0.329956853 -0.350941227 -0.163496962 -0.413414477 -0.593246582 -0.818095517
#> IND500 IND11 IND354 IND246 IND439 IND392
#> -0.501920608 -0.214219476 0.007729149 -0.275404004 -0.644399750 -0.285348795
#> IND448 IND506 IND192 IND484 IND203 IND452
#> -0.492513383 -0.310190245 -0.024709405 -0.435482280 -0.269321957 -0.552898172
#> IND92 IND134 IND287 IND424 IND307 IND122
#> -0.101055241 0.008905316 -0.306358542 -0.461001763 -0.356879362 0.058371001
#> IND414 IND212 IND20 IND127 IND236 IND258
#> -0.311646715 -0.111162671 -0.193471243 0.028880043 -0.407672480 0.142610577
#> IND305 IND268 IND513 IND496 IND147 IND322
#> -0.340565157 -0.471679317 -0.404866473 -0.395120760 -0.072070907 -0.277843970
#> IND45 IND68 IND201 IND348 IND43 IND75
#> 0.035081966 -0.363338384 -0.652974372 -0.412158936 -0.071163309 -0.080312324
#> IND90 IND41 IND387 IND276 IND465 IND224
#> 0.025514564 -0.323200418 -0.477190145 -0.154720858 -0.341964318 -0.196673915
#> IND481 IND143 IND226 IND480 IND126 IND498
#> -0.663275058 -0.056703898 -0.082764588 -0.210979906 0.048488465 -0.283945815
#> IND233 IND195 IND61 IND173 IND97 IND478
#> -0.266472867 -0.217675991 0.014263476 -0.454493334 -0.550484287 -0.041911623
#> IND18 IND397 IND180 IND315 IND338 IND359
#> -0.330665666 -0.436392508 -0.609082779 -0.569036586 -0.212491654 -0.343991921
#> IND508 IND101 IND232 IND454 IND145 IND370
#> -0.477263014 -0.464338378 -0.072071562 -0.227516541 -0.389293368 -0.173894914
#> IND159 IND12 IND77 IND421 IND112 IND289
#> 0.119232022 -0.094999753 -0.481360824 -0.433878747 -0.100941994 -0.581080319
#> IND353 IND235 IND29 IND221 IND335 IND249
#> -0.332866777 -0.078403530 -0.340509458 -0.130391230 -0.223907121 -0.298415252
#> IND84 IND296 IND183 IND344 IND259 IND384
#> -0.199496411 -0.410031721 -0.377059571 -0.693998245 -0.370623474 -0.246649423
#> IND148 IND485 IND264 IND54 IND119 IND146
#> -0.271335029 -0.225853454 -0.261874967 0.087807798 -0.158416463 -0.214176185
#> IND93 IND80 IND16 IND350 IND503 IND490
#> -0.420587615 -0.521984344 -0.070014457 -0.477095206 -0.366749255 -0.412342395
#> IND270 IND70 IND385 IND433 IND428 IND280
#> -0.502675568 -0.087020068 -0.312957603 -0.141636588 -0.630821691 -0.190763136
#> IND234 IND104 IND155 IND417 IND207 IND67
#> -0.145463209 -0.065338172 -0.102717325 -0.406053957 -0.487929907 -0.542512235
#> IND471 IND6 IND413 IND412 IND245 IND96
#> 0.034250790 -0.327396727 -0.497126708 -0.185326538 -0.331678057 -0.078888704
#> IND63 IND301 IND401 IND263 IND352 IND316
#> -0.238519359 -0.297359129 -0.424067348 -0.282617895 -0.050625075 -0.255118778
#> IND341 IND489 IND388 IND457 IND181 IND431
#> -0.459988827 -0.313308038 -0.311893938 0.040361232 -0.454678575 -0.575772676
#> IND140 IND512 IND330 IND420 IND377 IND139
#> -0.218079182 -0.400390897 -0.392403815 -0.258866066 -0.353999148 0.032783885
#> IND171 IND165 IND395 IND436 IND475 IND167
#> 0.002110163 -0.083110045 -0.540800075 -0.680934271 -0.444409304 -0.160315170
#> IND111 IND473 IND379 IND501 IND164 IND360
#> -0.189364791 -0.415996781 -0.534266922 -0.437482227 -0.269667607 -0.244362503
#> IND390 IND376 IND425 IND66 IND243 IND39
#> -0.016687981 -0.300365083 -0.559132052 -0.397677303 -0.253962997 -0.157969029
#> IND125 IND59 IND218 IND361 IND469 IND321
#> 0.198755812 -0.270404134 -0.157473118 -0.435929952 -0.198513588 -0.471754809This model object is a list with a Model and a
R2 slot. The Model slot stores a list of
models for each isoform.
We can convert this model from a list to a tibble using
the convert_model() function. Make sure you have an
annotation object to have position and REF/ALT allele information.
model_tsv = convert_model(model_isotwas,
snp_annot = snps$map,
snp_var = 'marker.ID')
model_tsv
#> # A tibble: 700 × 10
#> SNP Weight Transcript R2 R2.P chromosome genetic.dist physical.pos
#> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
#> 1 SNP10 -9.57e-5 Isoform1 0.317 0 1 0 10099
#> 2 SNP100 -2.42e-3 Isoform3 0.425 0 1 0 100107
#> 3 SNP1009 2.15e-3 Isoform3 0.425 0 1 0 1009108
#> 4 SNP101 8.68e-3 Isoform2 0.463 0 1 0 101086
#> 5 SNP1012 -1.99e-2 Isoform1 0.317 0 1 0 1012093
#> 6 SNP1017 -7.31e-3 Isoform2 0.463 0 1 0 1017093
#> 7 SNP1020 4.89e-3 Isoform2 0.463 0 1 0 1020103
#> 8 SNP1025 3.36e-2 Isoform1 0.317 0 1 0 1025113
#> 9 SNP103 -5.00e-2 Isoform2 0.463 0 1 0 103078
#> 10 SNP1042 -1.09e-2 Isoform2 0.463 0 1 0 1042088
#> # ℹ 690 more rows
#> # ℹ 2 more variables: allele1 <chr>, allele2 <chr>Method Recommendations
Based on simulations and real data analyses:
Default recommendation: Use
run_all = TRUEto evaluate all methods and let the function select the best model based on cross-validated R².For speed: If computation time is a concern, start with
c('multi_enet', 'sgl', 'univariate')which provides a good balance of multivariate and univariate approaches.When isoforms share structure: Use
graph_regwith a similarity matrix derived from shared exons. This helps when isoforms have correlated genetic effects due to shared regulatory regions.For robust predictions: The
stackingmethod combines multiple base methods and often provides stable predictions, especially when different methods perform well for different isoforms.For joint feature selection: Use
mtlassoorsglwhen you expect SNPs to affect multiple isoforms simultaneously.