hello,大家好,今天我们来分享一个新的空间注释的方法,当然主要思想还是解卷积,关于空间转录组的注释方法,我已经分享了很多了,在这里列举出来,For your information。
MIA for combined single-cell and spatial analysis
10X single cell and spatial joint analysis by cell2location
Summary of methods for joint analysis of 10X spatial transcriptome and 10X single-cell data
10X Single Cell Spatial Joint Analysis IV ----DSTG
10X Single Cell Spatial Joint Analysis III ----Spotlight
10X single-cell spatial joint analysis of five ----spatialDWLS
10X single-cell spatial joint analysis VI (single-cell spatial joint analysis based on the number of cells per spot) ----Tangram
10X Single Cell-10X Spatial Transcriptome Co-analysis VII ----CellDART
There are, of course, methods for annotating spatial transcriptomes on the basis of markers thatSummary of Ideas for 10X Spatial Transcriptome Data Analysis (for tumor samples)
10X Single Cell-10X Spatial Transcriptome Co-analysis VIII ----STRIDE (3D Reconstruction)
Of course, there are many methods, not necessarily all suitable for their own sample analysis, each method has its advantages and shortcomings, we have to choose according to their own actual situation. Well, to start our sharing today, the article is atReference-free cell-type deconvolution of pixel-resolution spatially resolved transcriptomics data, the method used is STdeconvolve, below put a few analyze the effect of the picture.
Of course, the space transcriptome can be analyzed is too much, if you are interested, you can comment at the bottom of the article, if there are more people, I will give you an open class, we can communicate with each other, well, start our content sharing today.
Abstract
Recent technological advances have made spatially resolved transcriptome analysis possible, but at multicellular pixel resolution, thereby hindering the identification of spatial co-localization patterns of cell types. The authors developed STdeconvolve as an unsupervised method to deconvolve underlying cell types containing such multicellular pixel resolution spatially resolved transcriptomics datasets. Practical application shows that STdeconvolve effectively recovers the putative transcriptomic features of the cell types and their scaled representation within spatially-resolved pixels without relying on external single-cell transcriptomic references (which would really be perfect if it were true that there were no single-cell data to come, let's take a look down the road).
Main
Depicting the spatial organization of transcriptionally distinct cell types within tissues is critical for understanding the cellular basis of tissue function. Recent techniques have enabled spatially resolved transcriptome (ST) analysis within tissues at multi-cell pixel resolution. As such, these ST measurements represent a mixture of cells that may contain multiple cell types. This lack of single-cell resolution hinders the characterization of cell type-specific spatial organization. To address this challenge, recently developedSPOTlight and supervised deconvolution methods such as RCTD to predict the proportion of cell types within the ST pixel. However, these supervised deconvolution methods rely on the availability of a suitable single-cell reference, and limitations may arise if such a reference is not available due to budgetary, technical, or biological constraints. Here, the authors developedSTdeconvolve作为一种用于解卷积多细胞像素分辨率 ST 数据的无监督、无参考的分析方法(chart below)。
Given a count matrix of ST data, STdeconvolve infers putative transcriptome features for different cell types and their proportional representation within each multicellular spatially resolved ST pixel. Briefly, STdeconvolve's first feature selects genes that are likely to provide information about transcription of different cell types. STdeconvolve then estimates the number of transcriptionally distinct cell type K's based on Latent Dirichlet Allocation (LDA) and deconvolves these K cell types within the ST pixel. STdeconvolve utilizes several unique features of ST data that make LDA particularly suitable for this application.
STdeconvolve was first evaluated in recovering scaled representations of cell types and their transcriptomic profiles using simulated ST dataperformances. This was accomplished by aggregating single-cell resolution multiple error robust fluorescence in situ hybridization (MERFISH) data from the mouse medial preoptic area (MPOA) into 100 μm2 pixels.
- classifier for sums of money:Ground truth single-cell resolution MERFISH data of one section of the MPOA partitioned into 100 μm2 pixels (black dashed squares). Each dot is a single cell colored by its ground truth cell-type label.
Applying STdeconvolve, cell types with different K=9 transcripts were identified and their transcriptome profiles and scaled representations were back-convoluted in each simulated pixel (lower panel).
- classifier for sums of money:B. Proportions of deconvolved cell-types from STdeconvolve, represented as pie charts for each simulated pixel。
To infer the identity of deconvoluted cell types for benchmarking, their deconvoluted transcriptional profiles were matched to those of real cell types (below).
- 注:(B)Pearson’s correlation between the transcriptional profiles of the 9 ground truth cell-types in the MERFISH MPOA data and the 9 deconvolved cell-types by STdeconvolve.(C). Distributions of Pearson’s correlations between the transcriptional profiles or pixel proportions of matched deconvolved and ground truth cell-types. Matched cell-types are indicated as highlighted boxes in B.
观察到每个去卷积细胞类型的转录组谱与跨基因匹配的真实细胞类型之间的强相关性,同样地,在模拟像素中每个去卷积细胞类型和匹配的真实细胞类型的比例之间存在强相关性。
- 注:(C)The ranking of each gene based on its expression level in the transcriptional profiles of the deconvolved cell-types, compared to its gene rank in the transcriptional profile of the matched ground truth cell-type. (D). Heatmap of Pearson’s correlations between the proportions of the deconvolved cell-types and ground truth cell-types across simulated pixels. Ground truth cell-types are ordered by their frequencies in the ground truth dataset. Matched deconvolved and ground truth cell-types are boxed.
一些去卷积的细胞类型,如细胞类型 X2 和 X8 都与兴奋性神经元匹配,而细胞类型 X4 和 X7 All matched with inhibitory neurons。 Based on previous notes,Further division of true excitatory and inhibitory cell types into other subtypes(common 76 kind),These deconvoluted cell types were found to be associated with specific combinations of neuronal subtypes
When further expanding the number of deconvolved cell types to K=76, it was possible to identify deconvolved cell typessuch as pericytes and microglia that were highly correlated in terms of transcriptional profiles and pixel ratios with finer neuronal subtypes as well as rare cell types.In addition, since current ST technology allows for spatial transcriptome analyses to be carried out at varying resolutions, it was further simulated another 20 μm2 resolution ST dataset and observed a similar strong correlation between the transcriptome profiles of the deconvolved cell types and the ratio of the underlying facts of STdeconvolvement (lower panel).
- classifier for sums of money:Comparison of STdeconvolve cell-types to ground truth cell-types of a 20 µm2 197 pixel MERFISH MPOA ST dataset。
Using simulated MPOA's 100 μm2 resolution ST data, STdeconvolve was also combined with an existing supervised deconvolution approachSPOTlight and RCTD were compared. For ideal single-cell transcriptomics reference, raw single-cell MERFISH data were used. The performance of each method was evaluated using the root-mean-square error (RMSE) of the deconvolved cell type proportions versus what is true on the simulated pixel (Methods). Overall, the performance of STdeconvolve was found to be comparable to that ofSPOTlightand RCTD are comparable. A potential limitation of such existing supervised deconvolution methods is their dependence on a suitable single-cell reference. Therefore, an attempt was made to evaluate their performance in the absence of a suitable single-cell reference by removing neuronal cell types from the MERFISH single-cell reference. Re-evaluating the performance resulted inSPOTlightand RCTD's RMSE increase.
Next, the performance of STdeconvolve was evaluated by analyzing 100 μm2 resolution ST data from the mouse main olfactory bulb (MOB), which consists of multiple bilaterally symmetric and transcriptionally distinct cell layers due to topographically organized sensory inputs. While previous cluster analysis of MOB ST data revealed the coarse spatial organization of the thick cell layer, finer structures such as rostral migratory flow (RMS) could not be readily observed.We applied STdeconvolve to identify K=12 transcriptionally distinct cell-types that either overlapped with or further split coarse cell layers previously identified from clustering analysis.
- classifier for sums of money:Deconvolved cell-type proportions for ST data of the MOB from STdeconvolve, represented as pie charts for each ST pixel. Pixels are outlined with colors based on the pixel transcriptional cluster assignment corresponding to MOB coarse cell layers.
In particular, although the deconvoluted cell type X7 overlaps with the granular cell layer previously identified by clustering analysis, it is spatially located at the position of the expected RMS.
Genes known to be up-regulated in their deconvolutional transcriptional profiles, including Nrep, Sox11, and Dcx, are associated with neuronal differentiation or up-regulated in neuronal precursor cells within the RMS and mark the expected spatial organization based on ISH staining.
This suggests that deconvolved cell type X7 corresponds to a neuronal precursor cell type within the RMS not identified from the clustering analysis. Similarly, using the appropriate MOB scRNA-seq reference to compare STdeconvolve with theSPOTlightand RCTD were compared and a high degree of correspondence was found between all assessment methods.
The performance of this supervised deconvolution method in the absence of a suitable single-cell reference was again compared by removing olfactory ensheathing cells (OECs) from the MOB scRNA-seq reference. OECs were initially predicted to be enriched in the olfactory nerve layer by all methods evaluated
However, given this new reference without OEC.SPOTlightand RCTD incorrectly predicted that N2 cells are enriched and highly abundant in the olfactory neuropil, despite the fact that all methods initially predicted that N2 cells are rare. In addition, trained on scRNA-seq references from mouse cortexSPOTlightand RCTD, resulting in cortical references to vascularized soft meninges (VLMC) cell clusters being incorrectly predicted to be highly enriched in the olfactory neuropil . Thus, supervised deconvolution methods may be sensitive to the single-cell transcriptomics reference used.
Finally, to demonstrate the potential of an unsupervised, unreferenced approach, STdeconvolve was applied to 100 μm2 resolution ST data from 4 breast cancer sections. Here, a matching scRNA-seq reference was not available and the use of a scRNA-seq reference from another breast cancer sample may not be appropriate due to potential inter-tumor heterogeneity. Transcript clustering of ST pixels previously identified 3 transcriptionally distinct clusters corresponding to 3 histological regions of tissue: ductal carcinoma in situ, invasive carcinoma, and nonmalignant .
However, the tumor microenvironment is a complex environment for many other cell types. Applying STdeconvolve to identify additional cell types and interrogate their spatial organization leads to K=15 identification of cell types.
- classifier for sums of money:(A)Deconvolved cell-type pixel proportions for ST data of a breast cancer tissue section, represented as pie charts. Pixels are outlined with colors based on the pixel transcriptional cluster assignment corresponding to 3 pathological annotations. (B). Highlight of deconvolved cell-type X15. Pixel proportion of deconvolved cell-type X15 are indicated as black slices in pie charts. Pixels are outlined with colors as in A). An H&E-stained image of the breast cancer tissue is shown in the background.
where the deconvoluted cell types X3, X13, and X15 corresponded spatially and the transcriptional profiles of deconvolution were consistent with nonmalignant, ductal carcinoma in situ, and invasive carcinoma annotations, respectively. However, deconvoluted cell type X15 was spatially enriched at the interface of cancerous and nonmalignant regions of the tissue. Highly expressed genes of de-convoluted cell type X15 include immune genes such as CD74 and CXCL10, and gene set enrichment analyses showed significant enrichment in immune processes suggesting that de-convoluted cell type 15 may correspond to immune cells. This spatial organization of immune cells along the tumor border has previously been suggested to play a role in breast cancer prognosis. Thus STdeconvolve may help to deconvolve different cell types transcribed in heterogeneous tissues to uncover potentially clinically relevant spatial organization.
In summary, the authors developed STdeconvolve as a tool for analyzing ST data to recover cell type ratios and transcriptional profiles without relying on single-cell transcriptomics references. The analysis demonstrates that STdeconvolve can generalize the expected biology and provide competitive performance with existing supervised methods when a suitable single-cell reference is available, as well as potentially superior performance when a suitable single-cell reference is not available. In general, it is expected that STdeconvolve will be useful for studying the spatial relationships between transcriptionally distinct cell types in complex heterogeneous tissues.
Method (headache)arithmeticIt's happening again.)
LDA Modeling (this model, is what we should focus on)
Selection of genes for LDA model
Selection of LDA model with optimal number of cell-types
Comparison between methods
Finally, let's take a look at the sample code
(of cargo etc) load
-
require(remotes)
-
remotes::install_github('JEFworks-Lab/STdeconvolve')
-
library(STdeconvolve)
-
data(mOB)
-
pos <- mOB$pos
-
cd <- mOB$counts
-
annot <- mOB$annot
STdeconvolve
The first feature selects genes that are most likely to be relevant for distinguishing cell types by looking for genes that are highly overdispersed across ST pixels. Pixels with too few genes or genes with too few reads can also be removed.
-
## remove pixels with too few genes
-
counts <- cleanCounts(counts = cd,
-
.size = 100,
-
= 1,
-
= 1)
-
## feature select for genes
-
corpus <- restrictCorpus(counts,
-
removeAbove=1.0,
-
removeBelow = 0.05,
-
alpha = 0.05,
-
plot = TRUE,
-
verbose = TRUE)
STdeconvolve
Latent Dirichlet allocation (LDA), a generative statistical model commonly used in natural language processing, is then applied to find theK
A potential cell type.STdeconvolve
fit a range of LDA models to inform the bestK
The choice.
-
## Note: the input corpus needs to be an integer count matrix of pixels x genes
-
ldas <- fitLDA(t(as.matrix(corpus)), Ks = seq(2, 9, by = 1), plot=TRUE, verbose=TRUE)
In this example, we will use the model with the lowest model perplexity.
The shaded region indicates where a fitted model for a given K had an alpha > 1. alpha is an LDA parameter that is solved for during model fitting and corresponds to the shape parameter of a symmetric Dirichlet distribution. In the model, this Dirichlet distribution describes the cell-type proportions in the pixels. A symmetric Dirichlet with alpha > 1 would lead to more uniform cell-type distributions in the pixels and difficulty identifying distinct cell-types. Instead, we want models with alphas < 1, resulting in sparse distributions where only a few cell-types are represented in a given pixel.
The resulting theta matrix can be interpreted as the proportion of each deconvolved cell-type across each spatially resolved pixel. The resulting beta matrix can be interpreted as the putative gene expression profile for each deconvolved cell-type normalized to a library size of 1. This beta matrix can be scaled by a depth factor (ex. 1000) for interpretability.
-
## select model with minimum perplexity
-
optLDA <- optimalModel(models = ldas, opt = "min")
-
## extract pixel cell-type proportions (theta) and cell-type gene expression profiles (beta) for the given dataset
-
results <- getBetaTheta(optLDA, corpus = t(as.matrix(corpus)))
-
deconProp <- results$theta
-
deconGexp <- results$beta*1000
-
vizAllTopics(deconProp, pos,
-
groups = annot,
-
group_cols = rainbow(length(levels(annot))),
-
r=0.4)
It is also possible to visualize the top marker genes for each deconvoluted cell type. We will use the deconvoluted cell types 5 and 1 as examples here. We define the top marker genes here as genes that are highly expressed in the deconvoluted cell types (counts > 5), and which also have the top 4 highest log2 (fold change) deconvoluted cell type expression profiles when comparing the expression profiles of the deconvoluted cell types to the average of all other cell types.
-
celltype <- 5
-
## highly expressed in cell-type of interest
-
highgexp <- names(which(deconGexp[celltype,] > 5))
-
## high log2(fold-change) compared to other deconvolved cell-types
-
log2fc <- sort(log2(deconGexp[celltype,highgexp]/colMeans(deconGexp[-celltype,highgexp])), decreasing=TRUE)
-
markers <- names(log2fc)[1:4]
-
## visualize
-
df <- merge(as.data.frame(pos),
-
as.data.frame(t(as.matrix(counts[markers,]))),
-
by = 0)
-
ps <- lapply(markers, function(marker) {
-
vizGeneCounts(df = df,
-
gene = marker,
-
size = 2, stroke = 0.1,
-
plotTitle = marker,
-
winsorize = 0.05,
-
showLegend = FALSE)
-
})
-
gridExtra::(
-
grobs = ps,
-
layout_matrix = rbind(c(1, 2),
-
c(3, 4))
-
)
-
celltype <- 1
-
## highly expressed in cell-type of interest
-
highgexp <- names(which(deconGexp[celltype,] > 5))
-
## high log2(fold-change) compared to other deconvolved cell-types
-
log2fc <- sort(log2(deconGexp[celltype,highgexp]/colMeans(deconGexp[-celltype,highgexp])), decreasing=TRUE)
-
markers <- names(log2fc)[1:4]
-
## visualize
-
df <- merge(as.data.frame(pos),
-
as.data.frame(t(as.matrix(counts[markers,]))),
-
by = 0)
-
ps <- lapply(markers, function(marker) {
-
vizGeneCounts(df = df,
-
gene = marker,
-
size = 2, stroke = 0.1,
-
plotTitle = marker,
-
winsorize = 0.05,
-
showLegend = FALSE)
-
})
-
gridExtra::(
-
grobs = ps,
-
layout_matrix = rbind(c(1, 2),
-
c(3, 4))
-
)
Of course.hardwareThere are some other features
In this tutorial, we will describe additional functionalities and advanced features of STdeconvolve.
Preprocessing
For LDA model fitting, STdeconvolve requires a “corpus of documents”, which is represented as a pixel (rows) x feature (columns) matrix of non-negative integer gene counts. To effectively deconvolve latent cell-types, the features in the corpus should be limited to genes that are variable across cell-types. Without prior knowledge of cell-types and their marker genes, one could look for over dispersed genes across the pixels, assuming that the underlying cell-types are present at variable proportions.
Additionally, it is useful to reduce the number of features in the corpus to those which are the most informative, to not only improve deconvolution but also increase speed of model fitting. To this end, one may wish to remove genes that are present in most or all pixels, or occur in only a small fraction of the pixels. This is because LDA identifies latent topics, or cell-types, by looking for sets of genes that occur frequently together in pixels. Therefore, removing genes that are present in most or all pixels will help restrict to gene that are cell-type specific, assuming that cell-types are not present across all pixels. Conversely, genes that are present in only a few pixels could represent noise and may not actually represent robust cell-type specific signatures.
As previously mentioned, a set of marker genes may be known a priori and a user may want to include these in the final corpus.
We include 2 different functions with STdeconvolve.
The first is restrictCorpus()
, which is highlighted in the getting_started
tutorial(It's the above.). This function first feature selects for over dispersed genes in the corpus and then allows a user to restrict the over dispersed genes to those present in less than or more than specified fractions of pixels in thedataset. This function does not filter for poor pixels or genes itself.
The second is preprocess()
, which is a larger wrapper function and allows for a much greater and specified range of filtered and feature
-
library(STdeconvolve)
-
data(mOB)
-
pos <- mOB$pos
-
cd <- mOB$counts
-
annot <- mOB$annot
In general, preprocess() includes a step to remove poor pixels and genes, allows one to select for specific genes to include or remove, allows an option to select for over dispersed genes, and options to remove top expressed genes, or genes present in less than or more than specified fractions of pixels in the dataset. Further, it returns an organized list that contains the filtered corpus, and the positions of the pixels retained in the filtered corpus if the information is present in the pixel names originally (for example, if the name of a pixel is in the format “XxY”). Otherwise this can be appended after.
Lastly, preprocess() can take as input a pixel (row) x gene (columns) matrix or a path to the file.
Order of filtering options in which they occur: 1. Selection to use specific genes only 2. cleanCounts to remove poor pixels and genes 3. Remove top expressed genes in matrix 4. Remove specific genes based on grepl pattern matching 5. Remove genes that appear in more/less than a percentage of pixels 6. Use the over dispersed genes computed from the remaining genes after filtering steps 1-5 (if selected) 7. Choice to use the top over dispersed genes based on -log10()
-
mobCorpus1 <- preprocess(t(cd),
-
alignFile = NA, # if there is a file to adjust pixel coordinates this can be included.
-
extractPos = FALSE, # optional argument
-
= NA, #
-
nTopGenes = 3, # remove the top 3 expressed genes (genes with most counts) in dataset
-
genes.to.remove = c("^Trmt"), # ex: remove tRNA methyltransferase genes (gene names that begin with "Trmt")
-
removeAbove = 0.95, # remove genes present in 95% or more of pixels
-
removeBelow = 0.05, # remove genes present in 5% or less of pixels
-
= 10, # minimum number of reads a gene must have across pixels
-
.size = 100, # minimum number of reads a pixel must have to keep (before gene filtering)
-
= 1, # minimum number of pixels a gene needs to have been detected in
-
ODgenes = TRUE, # feature select for over dispersed genes
-
nTopOD = 100, # number of top over dispersed genes to use, otherwise use all that pass filters if `NA`
-
= 0.05, # alpha param for over dispersed genes
-
= 5, # gam param for over dispersed genes
-
verbose = TRUE)
-
mobCorpus1$pos <- pos[rownames(mobCorpus1$corpus), ] # because positions were not available in the counts matrix itself, append after.
-
mobCorpus1$slm
-
## A 260x100 simple triplet matrix.
-
print(mobCorpus1$corpus[1:10,1:10])
-
## Bpifb9a Bpifb9b Col1a1 Dcn Cyp2a5 Sox11 Omp Ogn Prokr2 Ptn
-
## ACAACTATGGGTTGGCGG 0 0 1 0 0 1 0 0 0 2
-
## ACACAGATCCTGTTCTGA 1 1 0 0 0 0 6 0 0 22
-
## ACATCACCTGCGCGCTCT 0 0 1 2 0 6 0 0 0 0
-
## ACATTTAAGGCGCATGAT 0 0 0 1 0 0 1 0 1 1
-
## ACCACTGTAATCTCCCAT 0 0 0 1 1 0 1 0 0 1
-
## ACCAGAGCCGTTGAGCAA 0 0 0 0 0 1 3 0 0 2
-
## ACCCGGCGTAACTAGATA 0 1 0 1 0 3 0 0 1 2
-
## ACCGGAGTAAATTAGCGG 0 0 0 0 0 2 2 0 0 2
-
## ACCTGACAGCGGAAACTT 0 0 1 1 0 0 8 0 0 7
-
## ACGGAAATCAGTGGTATT 0 1 0 1 0 4 3 0 1 0
-
print(mobCorpus1$pos[1:10,])
-
## x y
-
## ACAACTATGGGTTGGCGG 16.001 16.036
-
## ACACAGATCCTGTTCTGA 26.073 15.042
-
## ACATCACCTGCGCGCTCT 13.048 21.079
-
## ACATTTAAGGCGCATGAT 13.963 18.117
-
## ACCACTGTAATCTCCCAT 24.104 13.074
-
## ACCAGAGCCGTTGAGCAA 9.040 12.046
-
## ACCCGGCGTAACTAGATA 15.941 12.112
-
## ACCGGAGTAAATTAGCGG 7.949 16.058
-
## ACCTGACAGCGGAAACTT 9.039 13.047
-
## ACGGAAATCAGTGGTATT 20.959 15.073
preprocess can also be used to build a corpus using a specific list of genes:
-
## get list of genes of interest, for an example.
-
counts <- cleanCounts(counts = cd,
-
.size = 100,
-
= 1,
-
= 1)
-
odGenes <- getOverdispersedGenes(as.matrix(counts),
-
=5,
-
alpha=0.05,
-
plot=FALSE,
-
use.=FALSE,
-
=TRUE,
-
=1e3,
-
=1e-3,
-
verbose=FALSE, details=TRUE)
-
genes <- odGenes$ods
-
length(genes)
-
## build corpus using just the selected genes
-
mobCorpus2 <- preprocess(t(cd),
-
= genes,
-
# can then proceed to filter this list, if desired
-
# = 1,
-
.size = 1, # can still filter pixels
-
= 1, # can still filter to make sure the selected genes are present in at least 1 pixel
-
ODgenes = FALSE, # don't select the over dispersed genes
-
verbose = TRUE)
-
Selecting Optimal K
One limitation of LDA is that one must select the number of predicted cell-types (K) to be returned, a priori. Thus, one must either have knowledge of the number of cell-types present in the dataset of interest, or a way to select the model with the “optimal K”, or the model that best describes the dataset and captures the latent cell-types.
To do this, STdeconvolve fits a number of different LDA models with different K’s to the dataset and computes several different metrics to help guide users in the choice of K.
First, the perplexity of each fitted model is computed with respect to it’s K. This can be done using the entire real dataset the model was fitted to, or, users can chose a certain fraction of randomly selected pixels to be held out and used as a test set to compute model perplexity. The optimal K can either be the model with the K that returns the lowest perplexity (“min”), or we compute a “knee” metric (similar to choosing the number of principle components in PCA), which is the maximum second derivative, a reasonable choice for the inflection point (the elbow or knee).
Second, as K increases, the additional cell-types that are predicted tend to be represented present at small proportions in the pixels and thus contribute little to the predicted pixel cell-type profiles. To help put an upper limit on K, we also measure the number of predicted cell-types with mean pixel proportion less than 5%. After a certain K, the number of “rare” predicted cell-types, or those with low proportions across the pixels, steadily increases, suggesting that increasing K is no longer returning informative topics, or cell-types.
-
## fit LDA models to the corpus
-
ks <- seq(from = 2, to = 18, by = 1) # range of K's to fit LDA models with given the input corpus
-
ldas <- fitLDA((mobCorpus2$corpus),
-
Ks = ks,
-
ncores = parallel::detectCores(logical = TRUE) - 1, # number of cores to fit LDA models in parallel
-
testSize = 0.2, # fraction of pixels to set aside for test corpus when computing perplexity
-
plot=TRUE, verbose=FALSE)
While technically the lowest perplexity computed is when K=8, perplexity appears to be relatively stable between K=7 and K=18. Additionally, we expect there to be more than 4 cell-types and thus K should be greater than 4 (based on “knee” metric).
However, the number of cell-types with mean proportion <5% doesn’t start steadily increasing until around K=15, suggesting that the number of predicted cell-types are likely informative until this chosen K.
Once the LDA models are fitted, beta and theta matrices can be extracted for a given model. The simplest way to do this is with optimalModel() to get the specific model of interest:
-
## `optimalModel()` can take as arguments:
-
optimalModel(models = ldas, opt = "min") # "min" = K that resulted in minimum perplexity
-
## A LDA_VEM topic model with 8 topics.
-
optimalModel(models = ldas, opt = "kneed") # "kneed" = K that resulted in inflection perplexity
-
## A LDA_VEM topic model with 4 topics.
-
optimalModel(models = ldas, opt = 15) # or extract the model for any K that was used
-
## A LDA_VEM topic model with 15 topics.
Then, getBetaTheta() can be used to get the beta (cell-type gene expression profiles) and theta (pixel cell-type proportions) matrices.
-
results <- getBetaTheta(lda = optimalModel(models = ldas, opt = "15"),
-
corpus = mobCorpus2$corpus)
-
print(names(results))
-
## [1] "beta" "theta"
Alternatively, buildLDAobject() is a wrapper for getBetaTheta(), clusterTopics() and combineTopics(). Cell types with similar gene expression profiles are clustered and an alternative set of "cell type clusters" is provided, where a cell type cluster is an aggregated beta and theta matrix of cell types within a given cluster.
-
results <- buildLDAobject(LDAmodel = optimalModel(models = ldas, opt = "15"),
-
corpus = mobCorpus2$corpus,
-
deepSplit = 4,
-
colorScheme = "rainbow")
Here, the result is a list of beta and theta matrices for individual predicted cell types, and combined beta and theta matrices for clusters of cell types. "cluster" is the factor indicating the specified cluster for each predicted cell type, and "tree" is a dendrogram of the clustered cell types relative to their predicted gene expression profiles. "cols" and "clustCols" are factors indicating the color labels of cell types or clusters, which can be selectively used for visualization purposes.
Clustering Cell-types
As mentioned above, predicted cell types can be clustered into "cell type clusters" based on their predicted gene expression profiles (β-matrix). A group of predicted cell types may represent components of a larger cell layer in a tissue, so it may be useful to cluster cell types together to visualize and evaluate the layer as individual features.
As shown earlier, the simplest way to do this is to use buildLDAobject(), which returns not only the beta and theta matrices of a single cell type, but also the cell type cluster and its associated beta and theta matrices.
However, with a beta matrix of cell types, it is possible to cluster cell types by calling clusterTopics() directly, which allows specifying the type of clustering performed. Assigning a cell type to a specific cluster is accomplished using the R package dynamicTreeCut using dynamic tree splitting.
-
results <- getBetaTheta(lda = optimalModel(models = ldas, opt = "15"),
-
corpus = mobCorpus2$corpus)
-
clust <- clusterTopics(beta = results$beta,
-
clustering = "", # type of clustering
-
dynamic = "hybrid", # method to assign cell-types to clusters (see `dynamicTreeCut` options)
-
deepSplit = 4, # dynamic tree cutting sensitivity parameter
-
plot = TRUE)
To get the beta and theta matrices of the cell-type clusters, combineTopics() is used to aggregate the beta or theta matrices of the cell-types within assigned clusters:
-
betaCombn <- combineTopics(results$beta, clusters = clust$clusters, type = "b")
-
## [1] "cell-types combined."
-
thetaCombn <- combineTopics(results$theta, clusters = clust$clusters, type = "t")
-
## [1] "cell-types combined."
-
clust$clusters
-
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-
## 2 1 5 1 3 4 3 5 2 3 4 2 1 1 1
-
## Levels: 1 2 3 4 5
-
dim(betaCombn)
-
## [1] 5 345
-
dim(thetaCombn)
-
## [1] 260 5
visualization
-
results <- buildLDAobject(LDAmodel = optimalModel(models = ldas, opt = "15"),
-
corpus = mobCorpus2$corpus,
-
deepSplit = 4,
-
colorScheme = "rainbow",
-
plot = FALSE)
-
m <- results$theta
-
p <- pos
-
plt <- vizAllTopics(theta = m,
-
pos = p,
-
topicOrder=seq(ncol(m)),
-
topicCols=rainbow(ncol(m)),
-
groups = NA,
-
group_cols = NA,
-
r = 0.4, # size of scatterpies; adjust depending on the coordinates of the pixels
-
lwd = 0.1,
-
showLegend = TRUE,
-
plotTitle = "K=15")
-
plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
-
plt
Scatterpie borders can be colored, either custom, or based on group membership
-
m <- results$theta
-
p <- pos
-
plt <- vizAllTopics(theta = m,
-
pos = p,
-
topicOrder=seq(ncol(m)),
-
topicCols=rainbow(ncol(m)),
-
groups = rep("0", dim(m)[1]),
-
group_cols = c("0" = "black"),
-
r = 0.4,
-
lwd = 0.4, # adjust thickness of the scatterpie borders
-
showLegend = TRUE,
-
plotTitle = "K=15")
-
plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
-
plt
Based on group membership (let’s use the coarse cell layers of the MOB)
-
m <- results$theta
-
p <- pos
-
plt <- vizAllTopics(theta = m,
-
pos = p,
-
topicOrder=seq(ncol(m)),
-
topicCols=rainbow(ncol(m)),
-
groups = annot,
-
group_cols = rainbow(length(levels(annot))),
-
r = 0.4,
-
lwd = 0.4, # adjust thickness of the scatterpie borders
-
showLegend = TRUE,
-
plotTitle = "K=15")
-
plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
-
plt
Now let’s visualize the cell-type clusters:
-
m <- results$thetaCombn
-
p <- pos
-
plt <- vizAllTopics(theta = m,
-
pos = p,
-
topicOrder=seq(ncol(m)),
-
topicCols=rainbow(ncol(m)),
-
groups = rep("0", dim(m)[1]),
-
group_cols = c("0" = "black"),
-
r = 0.4,
-
lwd = 0.2,
-
showLegend = TRUE,
-
plotTitle = "K=15 cell-type clusters")
-
# plt <- plt + ggplot2::guides(fill=guide_legend(ncol=2))
-
plt
Individual Cell-types and Cell-type clusters
-
m <- results$theta
-
p <- pos[rownames(results$theta),]
-
vizTopicClusters(theta = m,
-
pos = p,
-
clusters = results$cols,
-
sharedCol = TRUE, # cell-types in a cluster share the same color
-
groups = rep("0", dim(m)[1]),
-
group_cols = c("0" = "black"),
-
r = 0.4,
-
lwd = 0.3,
-
showLegend = TRUE)
-
m <- results$theta
-
p <- pos[rownames(results$theta),]
-
vizTopicClusters(theta = m,
-
pos = p,
-
clusters = results$cols,
-
sharedCol = FALSE, # cell-types in a cluster on a color gradient
-
groups = rep("0", dim(m)[1]),
-
group_cols = c("0" = "black"),
-
r = 0.4,
-
lwd = 0.3,
-
showLegend = TRUE)
-
m <- results$theta[,c("14", "12")]
-
p <- pos
-
other <- 1 - rowSums(m)
-
m <- cbind(m, other)
-
colnames(m) <- c("14", "12", "Other")
-
vizAllTopics(theta = m,
-
pos = p,
-
topicOrder=seq(ncol(m)),
-
topicCols=c(transparentCol("red", percent = 50), # BONUS: can make colors transparent if overlaying on top of an image
-
"black",
-
"white"),
-
groups = rep("0", dim(m)[1]),
-
group_cols = c("0" = "white"), # make scatterpie borders white to focus directly on the cell-type.
-
r = 0.4,
-
lwd = 0.1,
-
showLegend = TRUE,
-
overlay = NA) # BONUS: plot the scatterpies on top of a raster image of the H&E tissue, if set equal to the rgb matrix
Very good method, well worth the effort!
Life is good, and it's better with you.