library(dplyr)
library(Seurat)
# 10X data can be used using Read10Xfunction, a UMI count matrix will be returned, where each value indicates the number of molecules per gene (row) in each cell (column)
# class()
# str()
# Use this UMI count matrix to build a Seurat object (build an object first using non-standardized raw data);
# Seurat object is a container that contains data (such as expression matrix) and analysis results (such as PCA,Clustering). For a detailed explanation of the seurat object, see: /satijalab/seurat/wiki/Seurat#slots
# When CreateSeuratObject, the number of unique genes and total molecules will be automatically calculated, which can be found in object meta data.
pbmc
pbmc #The result is an active assay, which stores the expression matrix, and you can access it with pbmc[["RNA"]]@counts
str(pbmc) # You can use str to view the information of seurat object.
pbmc[[]] #Use [[ ]] is a quick access to assay.
pbmc@ #It can also be accessed through object@.
pbmc@version #If you want to view other information, you can use @
pbmc[["RNA"]]@counts #
#About seurat objects, further learning is needed!
# ------------------------------------------------------------------------
# CompareSparse matrixThe size of the so-called matrix (no!) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# # What does data in a count matrix look like?
[c("CD3D", "TCL1A", "MS4A1"), 1:30] #What does this UMI count matrix look like? Is it the same as the bulk transcriptome matrix we usually see? Find a few genes and see what happens in the first 30 cells
# From the results, we can find that it is a sparse Matrix, which is very messy. Use . to indicate that it is empty, with a value of 0, that is, no molecule was detected. Because single-cell data and bulk transcriptome data are not the same, most of the values of scRNA are 0. First, the initial reaction of cells during library establishmenttemplateThe concentration is very low; secondly, there is dropout phenomenon in sequencing (the gene was originally expressed in this cell, but it was not measured)
# Seurat adopts a clever reproduction method, which greatly reduces the space usage compared to the original matrix represented by 0. You can compare it:
/ #Compare the memory size between sparse matrices and matrices, and find that sparse matrices are more space-saving;
# -----------------------------------------------------------------------
# Standard pre-processing workflow ----------------------------------------
#The following is the standard preprocessing process for scRNA-seq data in Seurat. Including: selection and filtering of cells based on QC indicators; normalization and scaling of data; search for highly heterogeneous genes.
# * * * Quality Control (QC) and Screening Cells ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Seurat commonly used QC indicators:
# 1)* Number of unique genes detected in each cell. (Because there are very few genes for inferior cells or empty droplets; there are many genes for multiple cells in a droplet;)
# 2)* Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes) ???
# 3)* ratio of reads on map to mitochondrial genome. (Because low-quality or dead cells will have a lot of mitochondrial contamination;)
# This article /pmc/articles/PMC4758103/ introduces some QC standards
# Where are QC metrics stored in Seurat? ----------------------------------
#Quality control filtering, add columns to the object through [[ ]]. Here, the [[ ]] symbol adds new columns in the metadata subset of pbmc. This method can also select other features; storing the mitochondrial QC in the expression matrix separately in the Seurat object is equivalent to adding a column to the metadata object.
head(pbmc@, 5) #At this time, I found that there are only three columns
pbmc[[""]]
# # Check metadata to see where the QC indicator is stored.
head(pbmc@, 5) #If there is no previous step, pbmc[[""]] will not have this column! There is an extra column of content inside.
# Visualize the QC indicator and filter the cells according to the QC indicator; # Remove all unique genes with a total number of more than 2500 or less than 200, and remove all mitochondrial counts >5%.
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", ""), ncol = 3)
# You can also use the FeatureScatter function to draw a scatter plot to visualize the correlation between two sets of feature information. You can use anything in the Seurat object, such as columns in the object, PC scores, etc.
plot1
plot2
CombinePlots(plots = list(plot1, plot2)) #The results of this figure show that the number of reads with expression is generally inversely proportional to the number of reads in mitochondria. In cases where mitochondrial reads account for a very high proportion, there are usually very few expressions; on the contrary, if it is a truly expressed gene reads, it rarely comes from the mitochondrial genome; (right???) is the number of genes in the comparison results mentioned earlier, which is generally proportional to the reads count value obtained by sequencing.
# Cell filtration -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pbmc 200 & nFeature_RNA < 2500 & < 5) # Filter out unwanted cells,
# Data normalization ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Why is normalize called "normalization"?
# There is no need to worry about the name, the principle is easy to understand: because we are concerned about differentially expressed genes, that is, the difference between a gene in different samples (different cells), but this difference is often not directly compared, because the sequencing depth and library size of different samples may be different (for example, the primers and enzymes added to each EP tube when building a single cell library), so some factors may be different from the difference in gene expression.
#In order to more truly reflect the biological significance of gene differences, the data must be converted (such as RPKM, TPM, etc.), so that the same gene can be comparable in different samples. The starting line of "returning" is the one that comes to "returning". In addition, the original expression amount is a very high degree of discreteness.
#Some highly expressed genes may have thousands of expression, but some are only a few dozen. Even if some normalized algorithms are used to eliminate some technical errors, the actual poor expression volume often affects the aesthetics of drawing heat maps and box maps later. Therefore, you can use log to zoom in an interval (but it will not affect the real value).
#For example, if the original expression amount is 1, the result will still be 1 after conversion with log2(1+1); if the original expression amount is 1000, the original expression amount is 10, the conversion with log2(1000+1) is about 10. So the changes that were originally 1000 times different now are only 10 times different, making the data more concentrated without destroying the difference in the original data.
#About Seurat normalization, you can read this article: /content/biorxiv/early/2019/03/18/
pbmc
# The result is stored in: pbmc[["RNA"]]@data
# Identify differential genes ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Compare between cells and select the one with the largest difference in expression (that is, the same gene is highly expressed in some cells and very little expressed in some cells). Use the FindVariableFeatures function to calculate a mean-variance result, that is, the relationship between the mean value of the expression amount and the variance is given and top variable features are obtained.
# There are three algorithms for calculating top variable features, namely: vst (default), and dispersion
pbmc
top10
# plot variable features with and without labels
plot1
plot2
CombinePlots(plots = list(plot1, plot2))
# Data standardization ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The purpose of # scale is to realize linear conversion of data, which is also a standard preprocessing before dimensionality reduction processing (such as PCA). Mainly utilizes the ScaleData function. The previous normalize is log processing, which treats the expression of all genes uniformly and is finally placed on a starting line. However, in order to truly find the difference, we must also consider the impact of different samples on the expression volume based on this starting line.
# scale makes the mean expression of each gene in all cells 0 and the variance is 1
# The main purpose of this step is to prepare for the next step of dimensionality reduction, and using all genes is relatively slow. If you want to increase the speed, you can only operate on the identified HVGs (high heterogeneity genes, 2,000 foundVariableFeatures were set before). In fact, this function is to operate on some differential genes by default.
# The dimensionality reduction and clustering will not be affected in the results of this operation, but only the HVGs genes are standardized, and the heat maps drawn below will still be affected by certain extreme values in all genes. Therefore, for the results of the heat map, it is better to standardize all genes.
# ScaleData This function has two default parameters: = TRUE and = TRUE, and then you need to enter the gene name for scale/center (the default is to take the result of the previous FindVariableFeatures). Center means subtracting the mean of the expression of each gene in different cells; scale means dividing the value after centering by the standard deviation (that is, a z-score operation is performed).
# * * * Use all genes for scale ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
length()
pbmc
# * * * Scale only the identified HVGs ------------------------------------------------------------------------------------------------------------------
pbmc
# * * * Remove unwanted sources of differences -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# How to remove unwanted sources of differences? For example, if the differences caused by mitochondrial pollution affect the overall differential distribution, you can scale them in the following ways.
pbmc
# Linear Dimension Reduction --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# The real purpose of dimensionality reduction is to minimize the number of variables with correlation as much as possible. For example, there were 700 samples, that is, 700 dimensions/variables, but in fact, they will have some correlations based on the gene expression in the sample.
# In this way, some related samples will gather into a small group, indicating that their "feature distance" is similar. Finally, directly analyze these "small" and use the least variables to represent the most actual characteristics.
# Use the data after scale to perform PCA dimensionality reduction. It defaults to select the previously identified differential genes (2000) as input, but can be set using features; the default number of principal components analyzed is 50
pbmc
# Check PCA results
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5) # Or use print(pbmc@reductions$pca, dims = 1:3, nfeatures = 5)
#Visualize PCA results through VizDimReduction, DimPlot and DimHeatmap;
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca") # Extract two principal components (the first two by default, you can modify the dims option) to draw a scatter plot. Each point in this figure is a sample, which draws the result coordinates of the PCA, and this coordinate is saved in:
# head(pbmc[['pca']]@)[1:2,1:2]
# #It also supports defining grouping: grouping according to the ident in pbmc@
# table(pbmc@$) #Of course there is only one group here
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE) #Each cell and gene is sorted according to the PCA result score. By default, the first 30 genes (nfeatures setting), 1 principal component (dims setting), and there is no default value for the number of cells (cells setting); among which balanced represents positive and negative scores, half of each gene
#It is also possible DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
# Determine the dimension of the dataset ----------------------------------------------------------------------------------------------------------------------
# Since each principal component represents a small group of variables with correlation (called "metafeature" and meta-eigenset), then the question is: How to choose the appropriate number of principal components to represent the entire data set after dimensionality reduction?
# You can judge it by JackStraw or gravel map; # JackStrawPlot is more time-consuming and gravel map is faster.
# * * * JackStraw method --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pbmc
pbmc
JackStrawPlot(pbmc, dims = 1:15)
# * * * Gravel gravel method ----------------------------------------------------------------------------------------------------------------------
ElbowPlot(pbmc) #It is a graph sorted according to the percentage contribution of each principal component to the overall variation level. We mainly focus on the PC of the "elbow", which is a turning point. It is generally believed that the PC (primary component) before the inflection point can better represent the overall change.
# When you are uncertain at the beginning, you should select more principal components. If you choose too little, you may make mistakes in subsequent analysis.
# Cell Clustering ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pbmc
pbmc
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5) ## The results are gathered into several categories, and use Idents to view them
table(Idents(sce)) # can also be viewed into several categories.
#-----------------------------------------------------------------------
# One feature of linear dimensionality reduction PCA is that when it places dissimilar data points in the high dimension in a lower dimension, it will retain as much difference in the original data as possible, which leads to the dissimilar data being far apart, and most of the data being placed overlapping.
# tSNE takes into account high-dimensional dimensionality reduction + low-dimensional display. Those unsimilar data points after dimensionality reduction can still be placed closer, ensuring that the points are both partially scattered and gathered globally.
# UMAP developed later can show more dimensions than tSNE, but the tSNE diagram is better (reference: https://sci-hub.tw//10.1038/nbt.4314)
# Another dimensionality reduction method - nonlinear dimensionality reduction (UMAP/tSNE) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pbmc
DimPlot(pbmc, reduction = "umap") # You can also set label = TRUE to let the numbers be displayed on each cluster
saveRDS(pbmc, file = "C:/Users/HASEE/Desktop/Seurat/pbmc_tutorial.rds") #Save the results that are time-consuming to calculate the process.
# Find differentially expressed genes (cluster biomarkers) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# The purpose of finding differentially expressed genes is to distinguish genes from different cell populations based on the characteristics of different expression levels (that is, to find HVGs with biological significance, namely biomarkers).markerIt can be understood as a marker, which means that when you see these genes, it means that it is this cell population.
# Use the FindAllMarkers() function to compare a single cell population with several other cells to find positive and negative expression biomarker (the positive and negative here means up-regulating and down-regulating genes; positive marker means that the expression in my cluster is high, but low in other clusters)
head(, n = 5)
head(, n = 5)
%>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC) ## This step of filtering is well understood! (Classification, sorting, and picking the top 2)
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# Multiple visualization methods ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# VlnPlot、FeaturePlot、DoHeatmap
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE) #Use violin diagram to plot the expression amount of certain genes in all clusters
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")) #The most commonly used visualization method is to project gene expression into the dimensionality reduction clustering results
top10 % group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) #Plot heat maps for a given cell and gene, for example, plot the first 20 markers for each cluster
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
# Assign each cluster cell type ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# The focus is on understanding background knowledge, and can correspond to the marker with cell type. For example, the marker found in each cluster here corresponds to different cells (this process is a test of the degree of understanding of the topic)
names()
pbmc
DimPlot(pbmc, reduction = "umap", label = TRUE, = 0.5) + NoLegend()
saveRDS(pbmc, file = "C:/Users/HASEE/Desktop/Seurat/pbmc3k_final.rds")