I. Data preparation
The 10X single-cell transcriptome theoretically has three files that can be read into R for seurat analysis, , and , respectively, and the files and , are the row and column names of the expression matrix
It will be noticed that the 33694, 2049, and 1878957 values inside the file are the number of cells, the number of genes, and the number of values that have expression, respectively.
When processing downstream, it is important to make sure that all 3 files exist at the same time and are under the same folder, 3 files for each sample, and the same code processing for each sample.
II. General process
(i) Data pre-processing: quality control and data filtering
1. Based onQCCell selection and screening for metrics (i.e., QC)
2. data normalization and scaling (i.e., theData standardization)
3. Detection of highly variable traits (selection of characteristic genes)
(ii)PCAAnalysis: linear dimensionality reduction
PCA analysis and to find subsequent data processing of thedimension (math.)
(iii) Cellsclustering
Drawing edges between cells with similar gene expression patterns and then dividing them into an inline population
and performed tSNE and UMAP analyses
(v) Analysis of variances: findingmarker gene
Finding the marker gene for each cluster through differential expression, differential analysis can take many forms, such as finding the marker gene for all clusters (e.g., all markgene in cluster1 means that cluster1 is differential relative to all the rest of the clusters), differential analysis between two clusters, differential analysis between two samples in a certain cluster, difference analysis between two clusters, difference analysis between two samples in a cluster, etc.
(vi) Visualization of marker genes, i.e. cellular annotation
Three,Quality control analysis (QC)
1.Why do we need quality control?
Cell damage during cell isolation or failures in library preparation (ineffective reverse transcription or PCR amplification failures) often introduce some low-quality data. The main characteristics of these low-quality data are:
Cells as a whole have fewer counts values (columns)
Low expression of genes (rows)
A relatively high proportion of mitochondrial genes or spike-in
If these damaged rows or columns, are not removed, it may have an impact on the downstream analysis results. So we must be the first to remove these low-quality rows and columns before proceeding with the analysis. (At the beginning of the understanding, after the whole process is done, perhaps the understanding will be more, then the next in the next to make detailed additions)
2. Indicators for quality control
Sum of counts values for all genes per cell
During library preparation, RNA may be lost due to cell lysis or inefficient cDNA capture and amplification.Cells with a smaller sum of counts values were considered low quality cells and considered for removal.
Number of individual genes expressed per cell
Diverse transcripts that are not successfully captured, and therefore have very little gene expression in any one cell, are considered low quality and are considered for removal.
Proportion of spike-in sequences/mitochondrial genes to total counts values per cell
The spike-in sequence added in each cell (Reference frames for artificially added expressions) are all equal in concentration.If the spike-in ratio is high, then it means that a large number of transcripts were lost during the course of the experiment.
IV. PCA analysis
PCA (Principal Component Analysis), is one of the most widely used data dimensionality reductionarithmeticThe main idea of PCA is to map n-dimensional features onto k-dimensions, and to reduce the data dimensionality to speed up the process while preserving as much information as possible from the original data.data analysis。
The process is to find a system of mutually orthogonal axes sequentially from the original high-dimensional space, and the choice of new axes is closely related to the data itself.
Where the first new axis is chosen to be the direction with the largest variance in the original data, the second new axis is chosen to be the one that makes the largest variance in the plane orthogonal to the first axis, and the third axis is the one that makes the largest variance in the plane orthogonal to the 1st,2nd axis. By analogy, n such axes can be obtained. Most of the variance is contained in the first k axes, and the variance of the latter axes is almost 0. Thus, we can ignore the remaining axes, and only keep the first k axes that contain most of the variance, to realize the dimensionality reduction of the data.
V. Determinationdata setdimensionality
Purpose: Each dimension (pc) essentially represents a "meta-feature" that combines information from related feature sets. Therefore, the more principal components at the top, the more likely they are to represent the data set. However, how many principal components should we choose to consider that the data we select contains the majority of the original data information?
Methods: (1) JackStraw()function (math.), using a null distribution based permutation test. A portion of genes was randomly selected (default 1%) then pca analysis was performed to obtain pca scores. pca scores of this portion of genes were compared with previously calculated pca scores to obtain the significance p-Value. principal components were selected based on the judgment of the p-value of the genes included in the principal component (pc). The final result is the p-Value of each gene associated with each principal component. the retained principal components are those enriched with small p-Value genes. the JackStrawPlot() function provides visualization methods for comparing the distribution of p-Value for each principal component, with the dashed line being the uniform distribution; significant principal components are enriched with small p-Value genes, the solid line is located in the upper left of the dashed line.
(2) "The ElbowPlot function, based on the ordering of the percentage of variance explained by each principal component, looks for "inflection points" to determine how many dimensions can contain most of the information in the data.