For unsupervised clustering, typically only the most variably expressed genes are considered. There are multiple approaches to determining variability. The most common is to simply use variance. I was initially concerned that using raw variance would bias the selection towards more highly expressed genes, because higher expression values generally results in higher absolute variance values. Therefore, I was thinking of using the coefficient of variation instead.
However, if you compare gene expression before and after DESeq2 variance-stabilizing transformation, as expected, the relationship between the mean and variance is greatly reduced. Therefore, the need to normalize for mean expression is also greatly reduced.
scatterplot_expr_gene_mean_sd$labels$title <- "Raw expression counts per gene"
scatterplot_expr_gene_mean_sd$layers[3] <- NULL
scatterplot_texpr_gene_mean_sd <-
scatterplot_expr_gene_mean_sd %+% texpr_gene_stats +
labs(title = "Transformed expression counts per gene")
scatterplot_compare_expr_texpr <-
cowplot::plot_grid(scatterplot_expr_gene_mean_sd, scatterplot_texpr_gene_mean_sd)
scatterplot_compare_expr_texpr
That being said, variance is susceptible to outliers. A more robust measure of variability is the median absolute deviation about the median, or MAD. The formula for MAD is:
\[MAD = median( |X_i - median(X)| )\]
Because we are interested in exploring existing subtypes that should affect most samples, we aren’t as interested in variation that only arises in few samples (i.e. outliers). Hence, I will be using MAD as the metric for variability.
First, we will select the most variably expressed genes based on this metric.
top_mad <- most_mad(texpr, ntop)
texpr_top_mad <- texpr[top_mad, ]
Let’s quickly explore the properties of the selected genes relative to other genes. As we would expect, the selected genes are among those with the highest standard deviation (SD). However, there are a number of genes with high SD but have not been selected. This is most likely due to the robustness of MAD to outliers. Presumably, the high-SD genes that were not selected have high SD because of a few outlier samples.
texpr_gene_stats$selected <- "No"
texpr_gene_stats[top_mad, "selected"] <- "Yes"
scatterplot_texpr_gene_mean_sd_highlight_top_mad <-
dplyr::filter(texpr_gene_stats, selected == "No") %>%
ggplot(aes(x = mean, y = sd, colour = selected, text = gene)) +
geom_point(alpha = 0.5) +
geom_point(data = dplyr::filter(texpr_gene_stats, selected == "Yes"),
alpha = 0.5) +
scale_color_manual(values = c(No = "grey", Yes = "red"), name = "Among selected?") +
place_legend(c(1,1)) +
labs(title = "Mean and SD of expression of selected versus other genes",
x = "Mean", y = "SD")
scatterplot_texpr_gene_mean_sd_highlight_top_mad
Now that we have selected the 500 most variably expressed genes, we can perform unsupervised clustering using different methods.
Principle component analysis on the 500 most variably expressed genes (defined by MAD) shows mild clustering on the NanoString COO status (coo_ns
), explained by PC2.
pca_data <- calc_pca(tdds, intgroup = "coo_ns", ntop = ntop)
pca_percentVar <- attr(pca_data, "percentVar")
cat(paste0("Variance explained by PC1: ", round(pca_percentVar[1]*100, 1), "%"),
paste0("Variance explained by PC2: ", round(pca_percentVar[2]*100, 1), "%"),
sep = "\n")
Variance explained by PC1: 42.6%
Variance explained by PC2: 6.1%
pca_texpr <-
ggplot2::ggplot(pca_data, aes(x = PC1, y = PC2, colour = coo_ns)) +
geom_point(data = dplyr::select(pca_data, -coo_ns),
size = 2, colour = "grey", alpha = 0.5) +
geom_point(size = 2) +
scale_colour_manual(values = colours$coo) +
labs(title = paste0("Principle component analysis (PCA)\n",
"According to the ", ntop, " most variable genes\n",
"Split by Lymph2Cx COO status"),
colour = "Cell-of-origin")
pca_texpr
pca_texpr_split <-
pca_texpr +
facet_wrap(~coo_ns) +
scale_colour_manual(values = colours$coo, breaks = NULL)
pca_texpr_split
While samples with the same Lymph2Cx COO status tend to cluster together, there are larger sources of variation that prevent COO clustering cohort-wide.
hm_annot <- coo[,-1, drop = FALSE]
plot_heatmap(texpr_top_mad, hm_annot, annotation_colors = colours)
A good sanity check of the NanoString COO assignments is to cluster while only using genes whose expression define COO. For this, we are using genes reported by Wright et al.. There are several unsupervised clustering methods available to us.
Hierarchical clustering of the Wright genes is much better at separating ABC and GCB DLBCL cases (assuming the NanoString COO status is correct).
hm_annot <-
hm_annot %>%
tibble::rownames_to_column("patient") %>%
dplyr::left_join(tibble::rownames_to_column(heatmap_mut_rf_prob_annot, "patient")) %>%
tidyr::replace_na(list(gcb_num_feats = "NA", abc_num_feats = "NA", abc_gcb_diff = "NA")) %>%
tibble::column_to_rownames("patient")
texpr_wright <- texpr[gene_ids_filt %in% gene_ids$wright, ]
pheatmap_wright_hclust <- plot_heatmap(texpr_wright, hm_annot, colours)
On the other hand, k-means clustering performs very poorly.
wright_kmeans <- kmeans(t(texpr_wright), 2)
pheatmap_wright_kmeans <- plot_heatmap(texpr_wright[, order(wright_kmeans$cluster)],
hm_annot, colours, cluster_cols = FALSE)
PCA on the Wright genes definitely separates the ABC and GCB cases with some overlap remaining. Furthermore, the NA cases are spread out across the PCA plot, while the UNC cases are clearly located at the interface between ABC and GCB, which is expected.
pca_wright_data <- DESeq2::plotPCA(tdds[gene_ids_filt %in% gene_ids$wright, ],
intgroup = "coo_ns", returnData = TRUE)
pca_wright <-
ggplot2::ggplot(pca_wright_data, aes(x = PC1, y = PC2, colour = coo_ns)) +
geom_point(data = dplyr::select(pca_wright_data, -coo_ns),
size = 2, colour = "grey", alpha = 0.5) +
geom_point(size = 2) +
scale_colour_manual(values = colours$coo) +
labs(title = paste0("Principle component analysis (PCA)\n",
"Restricted to Wright et al. genes\n",
"Split by Lymph2Cx COO status"))
pca_wright
pca_wright_split <-
pca_wright +
facet_wrap(~coo_ns) +
scale_colour_manual(values = colours$coo, breaks = NULL)
pca_wright_split
And the same can be done using the genes on the Lymph2cx NanoString assay.
texpr_lymph2cx <- texpr[gene_ids_filt %in% gene_ids$lymph2cx,]
plot_heatmap(texpr_lymph2cx, hm_annot, annotation_colors = colours)