Scoring gene sets at the single-cell level can be done in addition to enrichment as with analyses such as GSVA by cellmarker gene set does subpopulation annotation (need to pay attention to the problem of cell ratio, the number of small groups does not work, can be used as an aid). The previously written scMetabolism package scores metabolic pathways, same idea.
AUCell is ranking-based, so it doesn't matter exactly which normalization method is used, as long as it doesn't affect the ordering of gene expression within each cell. In general, it is divided into three steps:
1,build the rankings
2,calculate the AUC
3,set the assignment threshold
Installation of R packages
-
if (!requireNamespace("BiocManager", quietly=TRUE))
-
("BiocManager")
-
-
Biocmanager::install("AUCell")
-
-
# To support paralell execution:
-
BiocManager::install(c("doMC", "doRNG","doSNOW"))
-
# For the main example:
-
BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))
-
# For the examples in the follow-up section of the tutorial:
-
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
-
"dynamicTreeCut","R2HTML","Rtsne", "zoo"))
- load the scRNA expression matrix and gene set
-
# . Reading from a text file
-
#exprMatrix <- ("")
-
#exprMatrix <- (exprMatrix)
-
-
# or using an expression set
-
#exprMatrix <- exprs(myExpressionSet)
-
-
-
-
-
# (This may take a few minutes)
-
library(GEOquery)
-
attemptsLeft <- 20
-
while(attemptsLeft>0)
-
{
-
geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity)
-
if(methods::is(geoFile,"error"))
-
{
-
attemptsLeft <- attemptsLeft-1
-
(5)
-
}
-
else
-
attemptsLeft <- 0
-
}
-
-
gzFile <- grep(".", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)
-
message(gzFile)
-
txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)
-
message(txtFile)
-
gunzip(filename=gzFile, destname=txtFile, remove=TRUE)
-
-
library()
-
geoData <- fread(txtFile, sep="\t")
-
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
-
exprMatrix <- (geoData[,-1, with=FALSE])
-
rm(geoData)
-
dim(exprMatrix)
-
rownames(exprMatrix) <- geneNames
-
exprMatrix[1:5,1:4]
-
-
# Remove file
-
(txtFile)
-
-
# Save for future use
-
mouseBrainExprMatrix <- exprMatrix
-
save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")
pick random 5000 genes to release computational burden
-
# load("exprMatrix_AUCellVignette_MouseBrain.RData")
-
(333)
-
exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]
-
-
dim(exprMatrix)
the next thing we need to prepare is the gene set. use ?AUCell_calcAUC to check accepted forms.
-
library(GSEABase)
-
-
#gene list version
-
genes <- c("gene1", "gene2", "gene3")
-
geneSets <- GeneSet(genes, setName="geneSet1")
-
geneSets
-
-
#gmt version
-
library(AUCell)
-
library(GSEABase)
-
gmtFile <- paste((('examples', package='AUCell')), "", sep="/")
-
geneSets <- getGmt(gmtFile)
-
-
#check how many of these genes are in the expression matrix
-
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix))
-
cbind(nGenes(geneSets))
-
-
#add the gene set size to its name
-
geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))
-
-
#let’s also add a few sets of random genes and 100 genes expressed in many cells (. housekeeping-like):
-
-
# Random
-
(321)
-
extraGeneSets <- c(
-
GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),
-
GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))
-
-
countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))
-
# Housekeeping-like
-
extraGeneSets <- c(extraGeneSets,
-
GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))
-
-
geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))
-
names(geneSets)
-
-
- build gene expression rankings for each cell
-
cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)
-
-
cells_rankings
-
#The “rankings” can be seen as a new representation of the original dataset. Once #they are calculated, they can be saved for future analyses.
-
-
save(cells_rankings, file="cells_rankings.RData")
- calculate enrichment for the signature (AUC)
To determine whether the gene set is enriched at the top of the gene-ranking for each cell, AUCell uses the “Area Under the Curve” (AUC) of the recovery curve.
-
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
-
save(cells_AUC, file="cells_AUC.RData")
col is the cell id, row is gene set, AUC in each cell.
- determine the cells with given gene signatures or active gene sets
The AUC estimates the proportion of genes in the gene-set that are highly expressed in each cell.
The ideal situation will be a bi-modal distribution, in which most cells in the dataset have a low “AUC” compared to a population of cells with a clearly higher value
Note that if the gene set is a cellular marker, the cellular subpopulation ratio must be taken into account. If the subpopulation ratio is very high, then the low peaks in the double peaks may be masked and behave more like house keeping genes (single peak or even constant expression); similarly, if the ratio is very small, the peaks may be so insignificant that they cannot be seen.
In addition, the number of genes in the gene set should not be too small, or there is a possibility that none of the genes in the gene set can match the top 5% of the matrix, at which time the AUC=0. Therefore, the length of the gene set is more appropriate at 100-2k.
To ease the exploration of the distributions, the function AUCell_exploreThresholds()
automatically plots all the histograms and calculates several thresholds that could be used to consider a gene-set ‘active’ (returned in $aucThr
). The distributions are plotted as dotted lines over the histogram and the corresponding thresholds as vertical bars in the matching color. The thicker vertical line indicates the threshold selected by default ($aucThr$selected
): the highest value to reduce the false positives.
-
(123)
-
par(mfrow=c(3,3))
-
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)
-
# Look at the alerts stored in the results
-
warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)
-
warningMsg[which(warningMsg!="")]
-
-
-
-
## Microglia_lavin (165g)
-
## "The AUC might follow a normal distribution (random gene-set?). "
-
## Random (500g)
-
## "The AUC might follow a normal distribution (random gene-set?). The global distribution overlaps the partial distributions. "
The thresholds calcuated for each gene set are stored in the $aucThr
slot. For example, the thresholds suggested for the oligodendrocyte gene-set:
cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
-
## threshold nCells
-
## Global_k1 0.2419241 739
-
## L_k2 0.2437017 733
-
## R_k3 0.1766730 1008
-
## minimumDens 0.2417192 740
To obtain the threshold selected automatically for a given gene set(. Oligodendrocytes):
cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
Cells assigned at this threshold (AUC over the threshold)
-
oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
-
length(oligodencrocytesAssigned)
-
head(oligodencrocytesAssigned)
Plotting the AUC histogram of a specific gene set, and setting a new threshold:
-
geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
-
AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)
-
abline(v=0.25)
Assigning cells to this new threshold:
-
newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))
-
length(newSelectedCells)
head(newSelectedCells)
Now, with the gene set scored in each cell, explore common visualization ideas, see the official documentation.
AUCell: Identifying cells with active gene sets
(1) Heat map (on-off binary or continuous values)
(2)feature plot