Skip to contents

Introduction

This module will cover how to use functions from isotwas to train an multivariate isotwas model for isoform-level expression. Along the way, we’ll be covering some practical considerations for computation and feature selection.

Train predictive model

To train a multivariate model, we need:

  1. a matrix of dosages of SNPs, and

  2. a matrix of isoform-level expressions for a given gene.

Isoform expression estimated from pseudo-alignment methods like salmon and kallisto provide bootstrapped estimates expression across a number of iterations. This information can be included into model training, as illustrated below.

Practical consideration: Ideally, to reduce your memory usage, we recommend storing your genetic data as PLINK format data (ideally as .bed/bim/fam or .pgen/.pvar/.psam). For a given gene, use PLINK to select only SNPs on the same chromosome and within 1 Megabase of the gene body. Then, you can use the bigsnpr package read in this “condensed” file set. Make sure your isoform expression matrix is log-transformed and residualized by covariates that are normally used in a QTL analysis (i.e., age, sex, principal components of the genotype matrix, PEER factors or hidden covariates with prior for the full expression matrix). In order to ensure maximal overlap between the model and the GWAS we will eventually use for trait mapping, we suggest restricting the SNPs in the QTL panel to only those present in the LD reference panel or to those annotated by the HapMap3 Project. This will ensure proper standard error calculations.

For this vignette about training an model, we’ll use some toy genetic and transcriptomic data provided.

train_bed = system.file("extdata", "train.bed", package = "isotwas")
snps = bigsnpr::snp_attach(bigsnpr::snp_readBed2(train_bed,
                                                 backingfile = tempfile()))
dim(snps$genotypes[])
#> [1]  300 4542

isoform_mat = readRDS(system.file("extdata", "isoform_exp.RDS", package = "isotwas"))
dim(isoform_mat)
#> [1] 300   3

This reads in a SNP matrix with dimensions 300×4542300 \times 4542, with samples along the rows and SNPs along the columns and a isoform-level expression matrix of dimension 300×3300 \times 3, with isoforms along the columns.

Practical consideration: Though this is not a requirement for the isotwas pipeline, we recommend a feature selection step based on cis-heritability of isoform-level expression. Prior of model training for isoforms of a given gene, we recommend that users compute the heritability of all expressed isoforms of a gene using GCTA (ideally the gcta-nr-robust executable in FUSION). We recommend users only train models for isoforms that have positive heritability at nominal P<0.05P < 0.05 or 0.010.01.

Practical consideration: Prior to heritability analysis and model training, it is imperative that your expression data is residualized by relevant covariates (i.e. principal components of genotype matrix, hidden covariates or PEER factors of the expression matrix, clinical/demographic covariates). A general rule of thumb is if you would include the covariate in an eQTL/isoform-level eQTL analysis, you should include it in the residualization process.

Let’s jitter our expression matrix a little to generate 10 bootstrapped estimates of the matrix. Keep in mind that these replicates, theoretically, should not affect the effect size estimates but will help in seeding the estimation process of the correlation between isoforms/error in the multivariate model.

boot_list = lapply(1:10,function(i){jitter(isoform_mat)})
isoform_mat_rep = rlist::list.rbind(boot_list)
rownames(isoform_mat_rep) = rep(paste0('Sample',1:nrow(snps$fam)),10)
colnames(isoform_mat_rep) = paste0('Isoform',1:ncol(isoform_mat_rep))
dim(isoform_mat_rep)
#> [1] 3000    3

Available Prediction Methods

The compute_isotwas() function supports 9 multivariate prediction methods:

Method Description
mrce_lasso Multivariate Regression with Covariance Estimation (MRCE) - jointly estimates regression coefficients and error covariance
multi_enet Multivariate elastic net - fits elastic net for each isoform with shared penalty
joinet Joinet stacked elastic net - uses cross-predictions to improve multivariate prediction
spls Sparse partial least squares - finds latent components that predict all isoforms
sgl Sparse group lasso - encourages both within-group and across-group sparsity
mtlasso Multi-task lasso (L21) - joint feature selection across all isoforms
stacking Super learner stacking - combines predictions from multiple base methods
graph_reg Graph-regularized regression - incorporates isoform similarity structure
univariate Best of univariate methods (elastic net, BLUP, SuSiE) per isoform

Training a Model

Now, we can train the model using the compute_isotwas() function. Make sure the SNP matrix has clear labels for the SNP identifiers that are relevant to your study.

The function now provides detailed progress output by default (verbose = TRUE), showing:

  • Data summary (samples, SNPs, transcripts)
  • Method-by-method progress with timing
  • Per-method R² performance
  • Final results comparison table
snp_mat = as.matrix(snps$genotypes[])
colnames(snp_mat) = snps$map$marker.ID
rownames(snp_mat) = snps$fam$sample.ID

# Train with all methods (recommended for best model selection)
isotwas_model = compute_isotwas(
  X = snp_mat,
  Y = isoform_mat,
  Y.rep = isoform_mat_rep,
  R = 10,
  id = rownames(isoform_mat_rep),
  omega_est = 'replicates',
  omega_nlambda = 5,
  nfolds = 5,
  verbose = TRUE,
  seed = 1789,
  run_all = TRUE,        # Run all 9 methods
  return_all = TRUE      # Return results from all methods
)

Running Specific Methods

If you want to run only specific methods (e.g., for faster computation), set run_all = FALSE and specify the methods:

# Run only a subset of methods
isotwas_model = compute_isotwas(
  X = snp_mat,
  Y = isoform_mat,
  Y.rep = isoform_mat_rep,
  R = 10,
  id = rownames(isoform_mat_rep),
  method = c('mrce_lasso', 'multi_enet', 'sgl', 'univariate'),
  nfolds = 5,
  verbose = TRUE,
  seed = 1789,
  run_all = FALSE
)

Using Graph-Regularized Regression

The graph_reg method can incorporate prior knowledge about isoform similarity (e.g., shared exon structure). You can create a similarity matrix from a GTF annotation file:

# Create similarity matrix from GTF (based on shared exons)
sim_result = similarity_from_gtf(
  gtf_path = "gencode.v45.annotation.gtf.gz",
  gene = "BRCA1",
  method = "jaccard"  # Options: "jaccard", "overlap_coef", "binary"
)

# Use in model training
isotwas_model = compute_isotwas(
  X = snp_mat,
  Y = isoform_mat,
  Y.rep = isoform_mat_rep,
  R = 10,
  id = rownames(isoform_mat_rep),
  similarity_matrix = sim_result$similarity_matrix,
  method = c('multi_enet', 'graph_reg', 'univariate'),
  run_all = FALSE,
  verbose = TRUE,
  seed = 1789
)

Function Parameters

A few notes on the key options:

  • omega_est: By selecting omega_est = 'replicates', we use the bootstrapped replicates to estimate the correlation between isoforms. The alternative is omega_est = 'mean', which will use whatever matrix you submit for Y.

  • omega_nlambda: Number of steps between fully dense to fully sparse in the correlation matrix estimation. We recommend 5-10 as the default.

  • method: Vector of methods to use. By default with run_all = TRUE, all 9 methods are run and the best model is selected.

  • similarity_matrix: Optional matrix for graph_reg method. Can be created using similarity_from_gtf() or create_similarity_from_exons().

  • predict_nlambda: Number of penalty variables for mrce_lasso. We recommend 10-20.

  • alpha: Elastic net mixing parameter (0 = ridge, 1 = lasso, 0.5 = elastic net).

  • verbose: If TRUE (default), prints detailed progress including per-method timing and performance.

  • run_all: If TRUE, all methods will be run regardless of method parameter.

  • return_all: If TRUE, results from all methods are returned (useful for method comparison).

Examining Results

The output from compute_isotwas() is an isotwas_result object with several components:

# Print summary
print(isotwas_model)

# Access the comparison table (R² for each method)
isotwas_model$comparison

# Access the best model
isotwas_model$best_models

# Access individual transcript results
isotwas_model$best_models$Isoform1$r2
isotwas_model$best_models$Isoform1$weights

# Get the full weight matrix
weight_mat = get_weight_matrix(isotwas_model$best_models)

Converting the Model

We can convert the model from a list to a tibble or data.frame using convert_model():

Let’s take a look at what a sample model looks like:

model_isotwas = readRDS(system.file("extdata", "model_example.RDS", package = "isotwas"))
class(model_isotwas)
#> [1] "list"
length(model_isotwas)
#> [1] 2
names(model_isotwas)
#> [1] "Model" "R2"
model_isotwas$Model[[1]]
#> $Transcript
#> [1] "Isoform1"
#> 
#> $Model
#> # A tibble: 187 × 2
#>    SNP        Weight
#>    <chr>       <dbl>
#>  1 SNP10  -0.0000957
#>  2 SNP19  -0.00158  
#>  3 SNP80  -0.00375  
#>  4 SNP89  -0.000884 
#>  5 SNP94  -0.00504  
#>  6 SNP124  0.00586  
#>  7 SNP195 -0.0201   
#>  8 SNP232 -0.00627  
#>  9 SNP233  0.000960 
#> 10 SNP261 -0.00688  
#> # ℹ 177 more rows
#> 
#> $R2
#> [1] 0.3174829
#> 
#> $P
#> value 
#>     0 
#> 
#> $Pred
#>       IND190       IND380       IND130       IND314       IND462       IND137 
#> -0.395051620 -0.405521454 -0.405642239 -0.513143425 -0.219109235 -0.472395293 
#>       IND191       IND228       IND253       IND358       IND318         IND3 
#> -0.186011347 -0.047232532 -0.039820719 -0.203023847 -0.279393749 -0.177154706 
#>        IND28       IND182        IND91       IND476       IND255       IND209 
#> -0.261468441 -0.572125420 -0.351933854 -0.493579401 -0.417614473 -0.059488250 
#>       IND135       IND223       IND340       IND120       IND115       IND391 
#> -0.318086976 -0.319871820 -0.135863999  0.083039071 -0.469286056 -0.281497318 
#>       IND304       IND163       IND386        IND69       IND121       IND269 
#> -0.342756842 -0.169790486 -0.552650084 -0.339994140  0.018921426 -0.359762240 
#>        IND24        IND31       IND349        IND32       IND332       IND495 
#> -0.171493417 -0.450912106 -0.411439831 -0.220513041 -0.445241556 -0.302148360 
#>       IND328       IND371       IND272        IND23        IND55       IND463 
#> -0.572640195 -0.090248595 -0.172602006 -0.107964558 -0.450873553 -0.297293819 
#>       IND194       IND441       IND113        IND47       IND502       IND409 
#>  0.002500033 -0.082561510 -0.053926154  0.014315858 -0.263294983 -0.424747059 
#>        IND82        IND48       IND303       IND176       IND393        IND52 
#> -0.215259827 -0.180472021 -0.566428240 -0.154369420 -0.292469824 -0.125122866 
#>       IND458       IND426        IND15        IND10       IND211       IND244 
#> -0.297137749 -0.571241703 -0.271153164 -0.372562537 -0.139310663  0.058872696 
#>        IND65        IND50       IND442       IND213       IND157       IND300 
#> -0.353977257 -0.370280430 -0.369661968 -0.308463656 -0.015128680 -0.372327867 
#>        IND25       IND254        IND49       IND199       IND189       IND295 
#> -0.294685131  0.207142198 -0.265891894 -0.210070398 -0.125493031 -0.208084001 
#>        IND88       IND459       IND339        IND46       IND325       IND419 
#> -0.188952067 -0.341369214 -0.323699038 -0.194042453 -0.177038660  0.040164417 
#>       IND302       IND216       IND178       IND282       IND186       IND239 
#> -0.388742683 -0.144205644 -0.215211141 -0.347436381  0.070054061 -0.016246983 
#>        IND89        IND71        IND33       IND494       IND369       IND109 
#> -0.558938422 -0.009956788 -0.518804964 -0.397442503 -0.462158482 -0.620882241 
#>        IND64       IND364         IND1       IND323        IND60       IND151 
#> -0.470768744 -0.505031962 -0.474205183 -0.369106906 -0.183303208 -0.474395370 
#>       IND293       IND437        IND85       IND324       IND375       IND423 
#> -0.216436666 -0.452684526 -0.437603184 -0.588708106 -0.588703304 -0.443682464 
#>       IND367       IND516        IND95       IND461       IND451        IND76 
#> -0.601498157 -0.233798606 -0.410275308 -0.511844894 -0.326213821 -0.370696978 
#>       IND298       IND310       IND222       IND102       IND491       IND422 
#> -0.502436692 -0.408828852 -0.111287912 -0.226551260 -0.223498530  0.158679573 
#>       IND351       IND309       IND277       IND214         IND0       IND438 
#> -0.344041151 -0.424716100 -0.612379239 -0.231148880 -0.417714624 -0.438855224 
#>       IND453       IND400       IND443       IND227       IND331       IND169 
#> -0.594149200 -0.385144435 -0.456708590 -0.532575545 -0.315295504 -0.235696550 
#>       IND197       IND142       IND416        IND51       IND346        IND27 
#> -0.298466128  0.034104976 -0.526523901  0.508010660 -0.126042297 -0.128398444 
#>       IND297       IND479       IND488       IND290        IND44       IND308 
#> -0.259539348 -0.454747553 -0.504412520 -0.510470021 -0.142340184  0.072124794 
#>       IND486         IND7       IND284       IND444       IND220       IND374 
#> -0.329956853 -0.350941227 -0.163496962 -0.413414477 -0.593246582 -0.818095517 
#>       IND500        IND11       IND354       IND246       IND439       IND392 
#> -0.501920608 -0.214219476  0.007729149 -0.275404004 -0.644399750 -0.285348795 
#>       IND448       IND506       IND192       IND484       IND203       IND452 
#> -0.492513383 -0.310190245 -0.024709405 -0.435482280 -0.269321957 -0.552898172 
#>        IND92       IND134       IND287       IND424       IND307       IND122 
#> -0.101055241  0.008905316 -0.306358542 -0.461001763 -0.356879362  0.058371001 
#>       IND414       IND212        IND20       IND127       IND236       IND258 
#> -0.311646715 -0.111162671 -0.193471243  0.028880043 -0.407672480  0.142610577 
#>       IND305       IND268       IND513       IND496       IND147       IND322 
#> -0.340565157 -0.471679317 -0.404866473 -0.395120760 -0.072070907 -0.277843970 
#>        IND45        IND68       IND201       IND348        IND43        IND75 
#>  0.035081966 -0.363338384 -0.652974372 -0.412158936 -0.071163309 -0.080312324 
#>        IND90        IND41       IND387       IND276       IND465       IND224 
#>  0.025514564 -0.323200418 -0.477190145 -0.154720858 -0.341964318 -0.196673915 
#>       IND481       IND143       IND226       IND480       IND126       IND498 
#> -0.663275058 -0.056703898 -0.082764588 -0.210979906  0.048488465 -0.283945815 
#>       IND233       IND195        IND61       IND173        IND97       IND478 
#> -0.266472867 -0.217675991  0.014263476 -0.454493334 -0.550484287 -0.041911623 
#>        IND18       IND397       IND180       IND315       IND338       IND359 
#> -0.330665666 -0.436392508 -0.609082779 -0.569036586 -0.212491654 -0.343991921 
#>       IND508       IND101       IND232       IND454       IND145       IND370 
#> -0.477263014 -0.464338378 -0.072071562 -0.227516541 -0.389293368 -0.173894914 
#>       IND159        IND12        IND77       IND421       IND112       IND289 
#>  0.119232022 -0.094999753 -0.481360824 -0.433878747 -0.100941994 -0.581080319 
#>       IND353       IND235        IND29       IND221       IND335       IND249 
#> -0.332866777 -0.078403530 -0.340509458 -0.130391230 -0.223907121 -0.298415252 
#>        IND84       IND296       IND183       IND344       IND259       IND384 
#> -0.199496411 -0.410031721 -0.377059571 -0.693998245 -0.370623474 -0.246649423 
#>       IND148       IND485       IND264        IND54       IND119       IND146 
#> -0.271335029 -0.225853454 -0.261874967  0.087807798 -0.158416463 -0.214176185 
#>        IND93        IND80        IND16       IND350       IND503       IND490 
#> -0.420587615 -0.521984344 -0.070014457 -0.477095206 -0.366749255 -0.412342395 
#>       IND270        IND70       IND385       IND433       IND428       IND280 
#> -0.502675568 -0.087020068 -0.312957603 -0.141636588 -0.630821691 -0.190763136 
#>       IND234       IND104       IND155       IND417       IND207        IND67 
#> -0.145463209 -0.065338172 -0.102717325 -0.406053957 -0.487929907 -0.542512235 
#>       IND471         IND6       IND413       IND412       IND245        IND96 
#>  0.034250790 -0.327396727 -0.497126708 -0.185326538 -0.331678057 -0.078888704 
#>        IND63       IND301       IND401       IND263       IND352       IND316 
#> -0.238519359 -0.297359129 -0.424067348 -0.282617895 -0.050625075 -0.255118778 
#>       IND341       IND489       IND388       IND457       IND181       IND431 
#> -0.459988827 -0.313308038 -0.311893938  0.040361232 -0.454678575 -0.575772676 
#>       IND140       IND512       IND330       IND420       IND377       IND139 
#> -0.218079182 -0.400390897 -0.392403815 -0.258866066 -0.353999148  0.032783885 
#>       IND171       IND165       IND395       IND436       IND475       IND167 
#>  0.002110163 -0.083110045 -0.540800075 -0.680934271 -0.444409304 -0.160315170 
#>       IND111       IND473       IND379       IND501       IND164       IND360 
#> -0.189364791 -0.415996781 -0.534266922 -0.437482227 -0.269667607 -0.244362503 
#>       IND390       IND376       IND425        IND66       IND243        IND39 
#> -0.016687981 -0.300365083 -0.559132052 -0.397677303 -0.253962997 -0.157969029 
#>       IND125        IND59       IND218       IND361       IND469       IND321 
#>  0.198755812 -0.270404134 -0.157473118 -0.435929952 -0.198513588 -0.471754809

This model object is a list with a Model and a R2 slot. The Model slot stores a list of models for each isoform.

We can convert this model from a list to a tibble using the convert_model() function. Make sure you have an annotation object to have position and REF/ALT allele information.

model_tsv = convert_model(model_isotwas,
                          snp_annot = snps$map,
                          snp_var = 'marker.ID')
model_tsv
#> # A tibble: 700 × 10
#>    SNP        Weight Transcript    R2  R2.P chromosome genetic.dist physical.pos
#>    <chr>       <dbl> <chr>      <dbl> <dbl>      <int>        <int>        <int>
#>  1 SNP10    -9.57e-5 Isoform1   0.317     0          1            0        10099
#>  2 SNP100   -2.42e-3 Isoform3   0.425     0          1            0       100107
#>  3 SNP1009   2.15e-3 Isoform3   0.425     0          1            0      1009108
#>  4 SNP101    8.68e-3 Isoform2   0.463     0          1            0       101086
#>  5 SNP1012  -1.99e-2 Isoform1   0.317     0          1            0      1012093
#>  6 SNP1017  -7.31e-3 Isoform2   0.463     0          1            0      1017093
#>  7 SNP1020   4.89e-3 Isoform2   0.463     0          1            0      1020103
#>  8 SNP1025   3.36e-2 Isoform1   0.317     0          1            0      1025113
#>  9 SNP103   -5.00e-2 Isoform2   0.463     0          1            0       103078
#> 10 SNP1042  -1.09e-2 Isoform2   0.463     0          1            0      1042088
#> # ℹ 690 more rows
#> # ℹ 2 more variables: allele1 <chr>, allele2 <chr>

Method Recommendations

Based on simulations and real data analyses:

  1. Default recommendation: Use run_all = TRUE to evaluate all methods and let the function select the best model based on cross-validated R².

  2. For speed: If computation time is a concern, start with c('multi_enet', 'sgl', 'univariate') which provides a good balance of multivariate and univariate approaches.

  3. When isoforms share structure: Use graph_reg with a similarity matrix derived from shared exons. This helps when isoforms have correlated genetic effects due to shared regulatory regions.

  4. For robust predictions: The stacking method combines multiple base methods and often provides stable predictions, especially when different methods perform well for different isoforms.

  5. For joint feature selection: Use mtlasso or sgl when you expect SNPs to affect multiple isoforms simultaneously.