First, I load the Salmon quantification results. I use a transcript-to-gene map file to summarise the transcript counts at the gene level. Subsequently, I subset all expression matrices to the gene set common between the Salmon results and the htseq-count results.
# CDS only counts
salmon_cds_dir <- file.path(PROJHOME, "data", "salmon", "cds_only")
tx2gene_cds <- file.path(PROJHOME, "reference", "cds_transcript_gene.map")
salmon_cds <- load_salmon(salmon_cds_dir, tx2gene_cds)
counts_cds <- round(salmon_cds$counts)
# Full gene counts
salmon_full_dir <- file.path(PROJHOME, "data", "salmon", "full_gene")
tx2gene_full <- file.path(PROJHOME, "reference", "all_transcript_gene.map")
salmon_full <- load_salmon(salmon_full_dir, tx2gene_full)
counts_full <- round(salmon_full$counts)
counts_htseq_all <- set_rownames(expr, genes)
common_genes <- intersect(rownames(counts_cds), rownames(counts_htseq_all))
counts_cds <- counts_cds[common_genes, ]
counts_full <- counts_full[common_genes, ]
counts_htseq <- counts_htseq_all[common_genes, ]
counts_df <-
dplyr::bind_rows(
expr_matrix_to_df(counts_cds, "salmon_cds"),
expr_matrix_to_df(counts_full, "salmon_full"),
expr_matrix_to_df(counts_htseq, "htseq_cds")) %>%
tidyr::spread(method, expr)
Read coverage in NFKBIZ was heavily skewed towards the 3’ end. This skewness was expected to affect the expression estimates. For this reason, we opted for CDS-only gene models for estimating expression using htseq-count.
I was interested in employing Salmon for estimating expression in order to include non-coding genes while using the latest genome build (GRCh38). However, the same concern persisted: will including untranslated regions (esp. UTRs) affect expression estimates?
In this analysis, I’m simply comparing NFKBIZ expression across all samples depending on whether which gene model was used for the Salmon quantification (CDS only or full gene).
nfkbiz_counts_df <-
counts_df %>%
dplyr::filter(gene == "NFKBIZ")
nfkbiz_cds_vs_full <-
nfkbiz_counts_df %>%
ggplot(aes(x = salmon_cds, y = salmon_full)) +
geom_abline(slope = 1, colour = "dodgerblue", size = 1) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
coord_fixed() +
labs(
title = "NFKBIZ Expression",
x = "Salmon (CDS only; log scale)",
y = "Salmon (full gene; log scale)")
nfkbiz_cds_vs_full
Visually, the correlation between the two gene models looks very good. Quantitatively, the Pearson correlation coefficient is 0.9997833, which is extremely high. In essence, Salmon seems to be robust to skewness in the read coverage across the gene. It properly handles the large number of reads aligned to the 3’UTR in the case of NFKBIZ.
After confirming that the gene model has little effect on the expression estimates produced by Salmon, I next wanted to see how well Salmon’s estimates agree with HTSeq’s for NFKBIZ.
nfkbiz_salmon_vs_htseq <-
nfkbiz_counts_df %>%
ggplot(aes(x = salmon_cds, y = htseq_cds)) +
geom_abline(slope = 1, colour = "dodgerblue", size = 1) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
coord_fixed() +
labs(
title = "NFKBIZ Expression",
x = "Salmon (CDS only; log scale)",
y = "HTSeq (CDS only; log scale)")
nfkbiz_salmon_vs_htseq
As you can see above, the correlation between methods is worse than between gene models. The Pearson correlation coefficient is not as bad as it seems due to the log scale: 0.9571582. Given that all data points lie below the identity line (y = x
), HTSeq seems to be consistently underestimating expression compared to Salmon. It is unclear whether this is due to an increase in sensitivity or a decrease in specificity associated with Salmon.
After inspecting how different gene models and methods affect NFKBIZ expression, I am curious about their effects genome-wide. Because of the higher dimensionality, I am
cds_vs_full_patient_cor_df <-
counts_df %>%
dplyr::group_by(patient) %>%
dplyr::summarise(pearson = cor(salmon_cds, salmon_full, method = "pearson"),
spearman = cor(salmon_cds, salmon_full, method = "spearman"))
cds_vs_full_patient_cor_plot <-
cds_vs_full_patient_cor_df %>%
ggplot(aes(x = pearson, y = spearman, label = patient)) +
geom_point() +
labs(title = "Correlation coefficients between patients\nComparing gene models using Salmon")
cds_vs_full_patient_cor_plot
cds_vs_full_gene_cor_df <-
counts_df %>%
dplyr::group_by(gene) %>%
dplyr::mutate(min_median = min(median(salmon_cds), median(salmon_full))) %>%
dplyr::filter(min_median > 1) %>%
dplyr::summarise(pearson = cor(salmon_cds, salmon_full, method = "pearson"),
spearman = cor(salmon_cds, salmon_full, method = "spearman"))
cds_vs_full_gene_cor_plot <-
cds_vs_full_gene_cor_df %>%
ggplot(aes(x = pearson, y = spearman, label = gene)) +
geom_point() +
labs(title = "Correlation coefficients between genes\nComparing gene models using Salmon")
cds_vs_full_gene_cor_plot
salmon_vs_htseq_patient_cor_df <-
counts_df %>%
dplyr::group_by(patient) %>%
dplyr::summarise(pearson = cor(salmon_cds, htseq_cds, method = "pearson"),
spearman = cor(salmon_cds, htseq_cds, method = "spearman"))
salmon_vs_htseq_patient_cor_plot <-
salmon_vs_htseq_patient_cor_df %>%
ggplot(aes(x = pearson, y = spearman, label = patient)) +
geom_point() +
labs(title = "Correlation coefficients between patients\nComparing Salmon and HTSeq")
salmon_vs_htseq_patient_cor_plot
salmon_vs_htseq_gene_cor_df <-
counts_df %>%
dplyr::group_by(gene) %>%
dplyr::mutate(min_median = min(median(salmon_cds), median(htseq_cds))) %>%
dplyr::filter(min_median > 1) %>%
dplyr::summarise(pearson = cor(salmon_cds, htseq_cds, method = "pearson"),
spearman = cor(salmon_cds, htseq_cds, method = "spearman"))
salmon_vs_htseq_gene_cor_plot <-
salmon_vs_htseq_gene_cor_df %>%
ggplot(aes(x = pearson, y = spearman, label = gene)) +
geom_point() +
labs(title = "Correlation coefficients between genes\nComparing Salmon and HTSeq")
salmon_vs_htseq_gene_cor_plot