[AUCell] Scoring of single-cell gene sets

Scoring gene sets at the single-cell level can be done in addition to enrichment as with analyses such as GSVA by cellmarker gene set does subpopulation annotation (need to pay attention to the problem of cell ratio, the number of small groups does not work, can be used as an aid). The previously written scMetabolism package scores metabolic pathways, same idea.

AUCell is ranking-based, so it doesn't matter exactly which normalization method is used, as long as it doesn't affect the ordering of gene expression within each cell. In general, it is divided into three steps:

1，build the rankings

2，calculate the AUC

3，set the assignment threshold

Installation of R packages





if (!requireNamespace("BiocManager", quietly=TRUE))




    ("BiocManager")




 



Biocmanager::install("AUCell")




 



# To support paralell execution:



BiocManager::install(c("doMC", "doRNG","doSNOW"))




# For the main example:



BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))




# For the examples in the follow-up section of the tutorial:



BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",




                       "dynamicTreeCut","R2HTML","Rtsne", "zoo"))

load the scRNA expression matrix and gene set




# . Reading from a text file



#exprMatrix <- ("")



#exprMatrix <- (exprMatrix)



 



# or using an expression set



#exprMatrix <- exprs(myExpressionSet)



 



 



 



 



# (This may take a few minutes)



library(GEOquery)




attemptsLeft <- 20




while(attemptsLeft>0)




{



  geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity) 



  if(methods::is(geoFile,"error")) 



  {




    attemptsLeft <- attemptsLeft-1




    (5)




  }




  else




    attemptsLeft <- 0



}



 



gzFile <- grep(".", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)




message(gzFile)




txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)




message(txtFile)




gunzip(filename=gzFile, destname=txtFile, remove=TRUE)




 



library()




geoData <- fread(txtFile, sep="\t")




geneNames <- unname(unlist(geoData[,1, with=FALSE]))




exprMatrix <- (geoData[,-1, with=FALSE])




rm(geoData)





dim(exprMatrix)




rownames(exprMatrix) <- geneNames



exprMatrix[1:5,1:4]




 



# Remove file



(txtFile)




 



# Save for future use



mouseBrainExprMatrix <- exprMatrix



save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")

pick random 5000 genes to release computational burden




# load("exprMatrix_AUCellVignette_MouseBrain.RData")



(333)




exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]




 




dim(exprMatrix)

the next thing we need to prepare is the gene set. use ?AUCell_calcAUC to check accepted forms.




library(GSEABase)




 



#gene list version



genes <- c("gene1", "gene2", "gene3")




geneSets <- GeneSet(genes, setName="geneSet1")




geneSets



 



#gmt version



library(AUCell)




library(GSEABase)




gmtFile <- paste((('examples', package='AUCell')), "", sep="/")




geneSets <- getGmt(gmtFile)




 



#check how many of these genes are in the expression matrix



geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix)) 



cbind(nGenes(geneSets))




 



#add the gene set size to its name



geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))




 



#let’s also add a few sets of random genes and 100 genes expressed in many cells (. housekeeping-like):



 



# Random



(321)




extraGeneSets <- c(




  GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),




  GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))




 



countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))




# Housekeeping-like



extraGeneSets <- c(extraGeneSets,




                   GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))




 



geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))





names(geneSets)

build gene expression rankings for each cell




cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)




 



cells_rankings




#The “rankings” can be seen as a new representation of the original dataset. Once #they are calculated, they can be saved for future analyses.



 



save(cells_rankings, file="cells_rankings.RData")

calculate enrichment for the signature (AUC)

To determine whether the gene set is enriched at the top of the gene-ranking for each cell, AUCell uses the “Area Under the Curve” (AUC) of the recovery curve.




cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)




save(cells_AUC, file="cells_AUC.RData")

col is the cell id, row is gene set, AUC in each cell.

determine the cells with given gene signatures or active gene sets

The AUC estimates the proportion of genes in the gene-set that are highly expressed in each cell.

The ideal situation will be a bi-modal distribution, in which most cells in the dataset have a low “AUC” compared to a population of cells with a clearly higher value

Note that if the gene set is a cellular marker, the cellular subpopulation ratio must be taken into account. If the subpopulation ratio is very high, then the low peaks in the double peaks may be masked and behave more like house keeping genes (single peak or even constant expression); similarly, if the ratio is very small, the peaks may be so insignificant that they cannot be seen.

In addition, the number of genes in the gene set should not be too small, or there is a possibility that none of the genes in the gene set can match the top 5% of the matrix, at which time the AUC=0. Therefore, the length of the gene set is more appropriate at 100-2k.

To ease the exploration of the distributions, the function AUCell_exploreThresholds() automatically plots all the histograms and calculates several thresholds that could be used to consider a gene-set ‘active’ (returned in $aucThr). The distributions are plotted as dotted lines over the histogram and the corresponding thresholds as vertical bars in the matching color. The thicker vertical line indicates the threshold selected by default ($aucThr$selected): the highest value to reduce the false positives.




(123)




par(mfrow=c(3,3)) 



cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)




# Look at the alerts stored in the results



warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)




warningMsg[which(warningMsg!="")]




 



 



 



##                                                                                                         Microglia_lavin (165g) 



##                                                              "The AUC might follow a normal distribution (random gene-set?). " 



##                                                                                                                  Random (500g) 



## "The AUC might follow a normal distribution (random gene-set?).  The global distribution overlaps the partial distributions. "

The thresholds calcuated for each gene set are stored in the $aucThr slot. For example, the thresholds suggested for the oligodendrocyte gene-set:

cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds




##             threshold nCells



## Global_k1   0.2419241    739




## L_k2        0.2437017    733




## R_k3        0.1766730   1008




## minimumDens 0.2417192    740

To obtain the threshold selected automatically for a given gene set(. Oligodendrocytes):

cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected

Cells assigned at this threshold （AUC over the threshold）




oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment




length(oligodencrocytesAssigned)




head(oligodencrocytesAssigned)

Plotting the AUC histogram of a specific gene set, and setting a new threshold:




geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]





AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)




abline(v=0.25)

Assigning cells to this new threshold:




newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))




length(newSelectedCells)

head(newSelectedCells)

Now, with the gene set scored in each cell, explore common visualization ideas, see the official documentation.

AUCell: Identifying cells with active gene sets

(1) Heat map (on-off binary or continuous values)

（2）feature plot