Notes on the use of seurat (data processing, PCA, clustering)

I. Data pre-processing




# Keep the good habit of clearing the environment first and setting the save paths



rm(list = ls())




setwd("")

1. Data reading

load a package




#Load the program package



#multtest package installation tips.



#("BioManager")



#BiocManager::install("multtest"),



##If there is an error showing that the R version does not match do not rush to change the R version, directly open the R software to select the nearest mirror to install.



library(multtest)




library(Seurat)




library(dplyr)




library(patchwork)




library()

Import data

CreateSeuratObject is used to create a Seurat object.




 <- Read10X( = "hg19")




#Load the PBMC dataset, create Seurat object, counts for the source file to read, project for the Seurat object want to save the file name.



#Can add qualifications: for the minimum number of cells isolated in a tissue, for the minimum number of genes measured in a cell



pbmc <- CreateSeuratObject(counts = , project = "pbmc3k",  = 3,  = 200)




#3 ways to see the number of cells in pbmc



pbmc



## An object of class Seurat 



## 13714 features across 2700 samples within 1 assay ,The number of cells is2700classifier for individual things or people, general, catch-all classifier



## Active assay: RNA (13714 features, 0 variable features)



ncol(pbmc)




## [1] 2700



ncol()

2. Conducting quality control

When we refer to quality control of single-cell data, we are generally referring to the filtering of cells, which is in fact the process of filtering data from anbarcode The X gene matrix filters out a subset of barcodes that are not cellular, such as cellular fragmentation, bicellularity, dead cells, etc. The X gene matrix also filters out a subset of barcodes that are not cellular. These three types of barcodes can be characterized by their corresponding gene expression profiles:nCount (total gene expression count), nFeature (total gene count),
(erythrocyte gene expression ratio), (mitochondrial gene expression ratio)High values of nCount and nFeature may indicate a double cell, while low values may indicate a fragmented cell. Scale of red blood cells, scale of cell state, too high may be dying cells.

PercentageFeatureSet function (math.)is a scoring based on dividing the total number of counts: sum of counts for that gene set / sum of counts for all genes.

Percentage of gene set = numerator / denominator * 100;
Molecule: The sum of counts species for the specified gene, which in this analysis is theNumber of mitochondrial gene transcripts in the cell

Denominator: Get the nCount_RNA column from, which is the sum of the counts of all genes in each cell.

This function means that the percentage of genes per mitochondrion per cell/total genes per cell will be added as a column of data to pbmc's metadata, which has the column name ""




#QC was performed and the percentage (%) of the number of mitochondrial gene transcripts per cell was calculated and stored into metadata using the [[ ]] operator;



#Confirm the species when analyzing, if it's mouse, use mt



pbmc[[""]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") 



# During the creation of the object CreateSeuratObject(), the number of unique genes in the cell versus the total number of genes can be automatically calculated and can be found in the target object



head(pbmc@,5)




 



                  nCount_RNA nFeature_RNA 



AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759




AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958




AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363




AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845




AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

3. Visualization of QC results

Purpose: To further remove low-quality data by visualizing quality control results.

（1）Violin diagram, graph the 3 metrics in meta@data in pbmc




#(1) display QC results as violin graphs, names in features are in quotes, ncol=3 means graphs are displayed in three columns



VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", ""), ncol = 3)

（2）scatterplot




 



plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "")




plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

(3) Removing the extremes

The principle of data quality control is: You can't stop it if you can't see it. i.e., if you can't identify garbage in the data, keep it, because what you're removing may be an important gene or an important factor or an important subpopulation of rare cells, so be careful when removing extreme values, similar to drawing the door when running a flow-through.




#Delete the extremes



pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 &  < 5)




# of cells after deletion of extremes



ncol((pbmc[["RNA"]]@counts))




##2638

4. Data standardization

Method 1normalizationfunction, which defaults to LogNormalize's method

pbmc <- NormalizeData(pbmc,  = "LogNormalize",  = 10000)

Method II

SCTransform, a three-in-one approach that combines QC, normalization and de-identification of highly variable genes in one package

pbmc <- SCTransform(pbmc,  = "", verbose = FALSE)

5. Selection of characterizing genes

Calculate the subset of genes in the dataset that exhibit high cell-to-cell variation (i.e., they are highly expressed in some cells and lowly expressed in others). Focusing on these genes in downstream analyses helps to highlight biosignatures in single-cell datasets. By default, 2,000 highly variable genes per dataset are selected for downstream analysis.

The FindVariableFeatures function has three methods for selecting highly expressed variants, which can be selected by the parameters: vst (default), and dispersion. The default value of the nfeatures parameter is 2000, which can be changed. If the parameter is , then there is no need to specify the number of highly expressed variants.arithmeticThe appropriate number will be selected automatically.




#Selection of highly variable genes, using the "vst" method to select 2,000 highly variable genes



pbmc <- FindVariableFeatures(pbmc,  = "vst", nfeatures = 2000)




#Store the top 10 highly variable genes



top10 <- head(VariableFeatures(pbmc), 10)




plot3 <- VariableFeaturePlot(pbmc)#Highly variable genes scatterplot




plot4 <- LabelPoints(plot=plot3, 



                     points = top10,




                     repel=TRUE)#top10 tagged with gene names




CombinePlots(plots = list(plot3, plot4), ncol =2)# Combined into one chart

II. Linear dimensionality reduction-PCA analysis

data

Jr. Q: What is the difference between NormalizeData() and ScaleData() in single-cell analysis, which also normalizes data?

NormalizeData() actually removes the effect of different cell sequencing depths, thesequencing depthIn short it is the ratio between the number of bases obtained by sequencing and the number of genomic bases. In general, the greater the sequencing depth for a cell, the more reads will be detected for each gene. This actually leads to the question, when dealing with single cell data, if a gene has a small number of reads in a cell, is it because the gene itself is rarely expressed, or is it because the cell itself has a small number of reads from sequencing? There are many genes in single-cell data that have a large number of reads, even thousands of reads, but there are many genes that have single digit or even zero reads, so this kind of data has a large degree of dispersion. However, we will find that when we take the logarithm of 1000 as the base of 10, the number becomes 3, and when we take the logarithm of 10 as the base of 10, the number becomes 1, which realizes the purpose of lowering the degree of dispersion of the data. But there is some problem with this, if a gene has a reads number of 0, wouldn't it be impossible to take a logarithm? Wouldn't we be able to solve this problem by adding 1 to all values?
So to summarize, the function NormalizeData() is a function that first corrects the number of reads of a gene for the size of the same library, and then performs a logarithmic operation on the corrected value.

ScaleData() is a z-score transformation of gene expression values to make them obey a normal distribution, and also paves the way for later pca analysis, which defaults the data to obey a normal distribution.




#Scale data



#Data with a homogeneity of 0 and a standard deviation of 1 were recorded as standardized data and used for the next step of PCA analysis.



## Alter the expression of each gene so that the average expression between cells is 0;



## Scale the expression of each gene so that the difference between cells is 1 (this step is given the same weight in downstream analysis so that highly expressed genes will then not affect subsequent analysis);



## Scaled data is stored in pbmc[["RNA"]]@ / pbmc@assays$RNA@.



 <- rownames(pbmc)




pbmc <- ScaleData(pbmc,features = )




#Decimate the scaled data, down to 50 dimensions by default.



pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

2、Downscaling results visualization

The goal is to see which dimensions need to be retained or excluded and not just see them as a bunch of letters. Now we know that if most of the genes in each PC are cell-type makergene, then that PC is cell-type specific and must be retained.




#Select the first 5 dimensions to view



print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)




PC_ 1 



Positive:  CST3, TYROBP, LST1, AIF1, FTL 



Negative:  MALAT1, LTB, IL32, IL7R, CD2 



PC_ 2 



Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 



Negative:  NKG7, PRF1, CST7, GZMB, GZMA 



PC_ 3 



Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 



Negative:  PPBP, PF4, SDPR, SPARC, GNG11 



PC_ 4 



Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 



Negative:  VIM, IL7R, S100A6, IL32, S100A8 



PC_ 5 



Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 



Negative:  LTB, IL7R, CKB, VIM, MS4A7 



# Use the DimPlot function to extract the two principal components (the first two by default, you can modify the dims option) to plot a scatterplot.



# Each point in this plot is a sample, which is plotted based on the coordinates of the PCA results, which are saved in the center



DimPlot(pbmc, reduction = "pca")




head(pbmc[['pca']]@)[1:2,1:2]




VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")




# Use the DimHeatmap visualization function to view the heatmap of the first 500 cells in the sample over the first 6 PCAs:



DimHeatmap(pbmc, dims = 1:6, cells = 500, balanced = TRUE)

3. Determine the dimensions of the data set

Based on the previous visualization of the dimensionality reduction results, combined with the JackStraw() or ElbowPlot() function to determine how many dimensions should be retained appropriate for subsequent analysis.

In JackStraw() function, the null distribution based permutation test is used. A portion of genes is randomly selected (default 1%) then pca analysis is performed to obtain pca scores. pca scores of this portion of genes are compared with previously calculated pca scores to obtain the significance p-Value. principal components are selected based on the judgment of the p-value of the genes contained in the principal component (pc). The final result is the p-Value of each gene associated with each principal component. the principal components that are retained are those that are enriched for small p-Value genes. This JackStrawPlot() function provides a visualization tool for comparing the distribution of p-values for each PC to a uniform distribution (dotted line). " Significant" PCs will show a large number of features with low p-values (solid line above the dashed line). In this case, the significance seems to drop off dramatically after the first 10-12 PCs.
ElbowPlot shows how well each principal component explains the variance of the data (expressed as a percentage) in a graph, sorted. Choose the principal components according to your needs, and the graph finds that the 9th principal component is an inflection point, and subsequent principal components (PCs) do not change much anymore.

JackStraw() function




pbmc <- JackStraw(pbmc,  = 100)




pbmc <- ScoreJackStraw(pbmc, dims = 1:20)




JackStrawPlot(pbmc, dims = 1:15)

or the ElbowPlot() function




#ElbowPlot



ElbowPlot(pbmc, ndims = 20)

III. Cell clustering

1. Clustering




#cell clustering



# Calculate nearest neighbor distance



pbmc <- FindNeighbors(pbmc, dims = 1:10)




# Clustering, contains the resolution parameter resolution which sets the "interval scale" for downstream clustering, increasing the value will result in more clusters.



pbmc <- FindClusters(pbmc, resolution = 0.5)




# Clustering cases can be found using the idents function:



#View clustered ids for the first 5 cells



head(Idents(pbmc), 5)

and umap visualization

Note:The use of umap need to build a python environment on the computer, it is recommended to download version 3.6 or above. Just pick a tutorial on the b-site and follow it.

Once installed, open it directly in RstudioTerminalEnter:sudo /usr/local/bin/pip install --upgrade virtualenv

Run the following code again to check if the download success flag is:library(umap)Can it work, this is the basis of the Seurat call.Finally, take special care to restart Rstudio and run it againRunUMAP

reticulate::py_install(packages ='umap-learn')




# install UMAP： reticulate::py_install(packages ='umap-learn')



pbmc <- RunUMAP(pbmc, dims = 1:10)




pbmc <- RunTSNE(pbmc, dims = 1:10)




#Drawing



DimPlot(pbmc, reduction = "umap")




DimPlot(pbmc, reduction = "tsne")




 



#add cell taxon xiba labeling



DimPlot(pbmc, reduction = "umap",label = TRUE)




LabelClusters(DimPlot(pbmc, reduction = "umap"),id = 'ident')




 



# can be saved here in the following form, eliminating the need for the above steps



saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

IV. Analysis and visualization of differences

1. Analysis of variances

The FindAllMarkers() function compares a class in the cluster with those remaining to find the differential genes. Its counterpart has a FindMarkers() function that specifies which two clusters to compare. Indicates the logfc threshold.

Parameter:Set the percentage of cells detected in either of the two cell populations to shorten the program run time by not detecting rarely expressed genes by this setting, default 0.1




# Variance analysis



#Looking for differential genes in cluster1



 <- FindMarkers(pbmc, ident.1 = 1,  = 0.25)




# Find the markgene between cluster5 and clusters0,3)



 <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3),  = 0.25)




#Find the markgene for each cluster compared to the rest of the clusters and report only the positive markgene



 <- FindAllMarkers(pbmc,  = TRUE,  = 0.25,  = 0.25)




## All genes are first grouped and then sorted according to avg_log2FC



 %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)

2. Add cell annotation information

About how about passing themarkerDetermine the cell type and refer to the URL:Xiuer! 10+ raw letter analysis of the biggest difficulties here! 30 kinds of methods how to choose? Help you out today! _pbmc ()




# Add cell annotations, if you know the cell taxa represented by each cluster.



 <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 



                     "NK", "DC", "Platelet")





names() <- levels(pbmc)




pbmc <- RenameIdents(pbmc, )




DimPlot(pbmc, reduction = "umap", label = TRUE,  = 0.5) + NoLegend()

Gene Visualization

（1）vlnplot




#VlnPlot: Gene expression probability distribution based on cellular taxa



VlnPlot(pbmc, features = c("MS4A1", "CD79A"))




VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

(2) View the distribution of expression of specific genes in a clustered graph




FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 



                               "CD8A"))

(3) View a ridge map of the expression of specific genes in cluster




features <- c("LYZ", "CCL5")




RidgePlot(pbmc, features = features, ncol = 2)

(4) View a map of the distribution points of specific genes in cluster




#View a dot plot of the distribution of a specific gene in cluster, the size of the dots represents the percentage of cells expressing the gene and the color represents the average expression level



DotPlot(pbmc, features = features) + RotatedAxis()

(5) DoHeatmap draws expression heatmaps for specified cells and genes




#DoHeatmap draws expression heatmaps for specified cells and genes. Default display of top 20 labeled genes per taxon



top10 <-  %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)




DoHeatmap(pbmc, features = top10$gene) + NoLegend()