Create Isoform Similarity Matrix from GTF
similarity_from_gtf.RdParses a GTF file and creates a similarity matrix for isoforms of a specified
gene based on shared exon structure. This matrix can be used with
multivariate_graph_reg for graph-regularized prediction.
Usage
similarity_from_gtf(
gtf_path,
gene,
method = c("jaccard", "overlap_coef", "binary"),
transcript_ids = NULL,
min_transcripts = 2,
verbose = TRUE
)Arguments
- gtf_path
character, path to GTF file (can be gzipped)
- gene
character, gene name (HGNC symbol) or Ensembl gene ID
- method
character, similarity method:
"jaccard": Jaccard index based on exon overlap (default)
"overlap_coef": Overlap coefficient = overlap / min(length_i, length_j)
"binary": 1 if any overlap, 0 otherwise
- transcript_ids
character vector, optional subset of transcript IDs to include. If NULL, uses all transcripts for the gene.
- min_transcripts
int, minimum number of transcripts required (default 2)
- verbose
logical, print progress messages
Value
A list containing:
similarity_matrix: symmetric matrix of pairwise isoform similarities
transcript_ids: character vector of transcript IDs (matrix row/col names)
gene_id: matched gene identifier
n_transcripts: number of transcripts
n_exons: named vector of exon counts per transcript
Details
The function computes pairwise similarity between isoforms based on their exon structure. Isoforms that share more exonic sequence are considered more similar, reflecting the biological intuition that they likely share more cis-regulatory effects.
The Jaccard similarity is computed as: $$J(i,j) = \frac{|overlap(exons_i, exons_j)|}{|union(exons_i, exons_j)|}$$
where overlap and union are computed in terms of genomic base pairs.
Examples
if (FALSE) { # \dontrun{
# Create similarity matrix for BRCA1 isoforms
sim <- similarity_from_gtf("gencode.v45.annotation.gtf.gz", "BRCA1")
# Use with graph-regularized prediction
model <- multivariate_graph_reg(X, Y, similarity_matrix = sim$similarity_matrix)
} # }