catalogs
About umi matrix learning
Calculate feature, counts values with umi
①Meta Data Viewing
②Count and Feature calculation (automatically calculated when generating Seurat)
1) Extracting the UMI matrix
2) Calculations
Other indicators
Assessment of quality indicators (focus)
1) UMI Count
2) Gene counting
3)UMIs vs. genes detected
4) Mitochondrial count ratio
5) Integrated filtration
Filtering to extract subsets
with respect toumimatrix learning
The 10X data does not need to account for the effect of gene length because of UMI, but it still needs to account for differences in sequencing depth from cell to cell, so it needs to be used with thefunction (math.)LogNormalize is processed by taking the UMI of the gene/all UMIs of the cell and multiplying by 10,000, and after LogNormalizing by column, it can then be scaled by row to remove the effect of very large and very small value genes on the data.About the processing of single-cell TPM, Count data_What is umicount and readcount of rds file - CSDN Blogs
Calculate feature, counts values with umi
scRNA-seq quality control processscRNA-seq-Quality Control ()
①Meta Data Viewing
③ Seurat process for single-cell learning-pbmc_seurat Deletion of discrete cells - CSDN Blogs
Seurat automatically creates some metadata for each cell data <- pbmc@
-
rm(list=ls())
-
library(dplyr)
-
library(Seurat)
-
library(patchwork)
-
## Read the data
-
pbmc.data <- Read10X(data.dir = "F:/##24 years of single-cell processing##/pbmc3k_filtered_gene_bc_matrices")
-
## Create Seruat objects
-
pbmc <- CreateSeuratObject(counts = pbmc.data,
-
project = "pbmc3k",
-
= 3, # : how many cells each feature is expressed in at least (feature=gene)
-
= 200) # : how many features are detected at least in each cell
-
pbmc
-
data <- pbmc@meta.data
> head(data) nCount_RNA nFeature_RNA AAACATACAACCAC-1 pbmc3k 2419 779 AAACATTGAGCTAC-1 pbmc3k 4903 1352 AAACATTGATCAGC-1 pbmc3k 3147 1129 AAACCGTGCTTCCG-1 pbmc3k 2639 960 AAACCGTGTATGCG-1 pbmc3k 980 521 AAACGCACTGGTAC-1 pbmc3k 2163 781
-
: Usually contains sample identifiers, usually defaults to
project
for the identity we assign to them -
nCount_RNA
: Number of UMIs per cell -
nFeature_RNA
: Number of genes detected per cell (non-zero number)
②Count and Feature calculation (generation)
Automatic calculation at Seurat)
1) Extracting the UMI matrix
-
# Extract the UMI matrix
-
exp <- GetAssayData(pbmc, slot = "counts", assay = "RNA")
-
umi_df <- data.frame(exp)
-
umi_df[1:5,1:3]
umi_df[1:5,1:3] AAACATACAACCAC.1 AAACATTGAGCTAC.1 AAACATTGATCAGC.1 AL627309.1 0 0 0 AP006222.2 0 0 0 RP11-206L10.2 0 0 0 RP11-206L10.9 0 0 0 LINC00115 0 0 0
2) Calculations
-
dat <- cbind(colnames(umi_df),
-
colSums(umi_df), ## Sum of each UMI column
-
colSums(umi_df != 0))#Number of genes detected per columnfeature
-
dat1 <- apply(dat[,c(2,3)],2,as.numeric) # Converted to numeric
-
rownames(dat1) <- (dat)
-
colnames(dat1) <- c("UMI1","UMI2")
-
dat1 <- as.data.frame(dat1)
-
> head(dat1) UMI1 UMI2 AAACATACAACCAC.1 2419 779 AAACATTGAGCTAC.1 4903 1352 AAACATTGATCAGC.1 3147 1129 AAACCGTGCTTCCG.1 2639 960 AAACCGTGTATGCG.1 980 521 AAACGCACTGGTAC.1 2163 781
Other indicators
-
number of genes detected per UMI: This metric gives us a good idea of thedata sethave an idea of the complexity of the data (the more genes detected per UMI, the more complex our data will be)
-
mitochondrial ratio: This metric will give us a percentage of cellular reads originating from mitochondrial genes.
The number of genes per UMI per cell is very easy to calculate, and we will perform a log10 transformation of the results to better compare between samples.
-
# Add number of genes per UMI for each cell to metadata
-
pbmc$log10GenesPerUMI <- log10(pbmc$nFeature_RNA) / log10(pbmc$nCount_RNA)
PercentageFeatureSet()
will adopt a certainmouldand search for gene identifiers. This function makes it easy to calculate the percentage of all counts belonging to a subset of possible functions for each cell.
-
# Compute percent mito ratio
-
pbmc$mitoRatio <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")
-
pbmc$mitoRatio <- pbmc@meta.data$mitoRatio / 100
Attention:The pattern provided (" ^ MT-") applies to human gene names.
New original data view
-
data1 <- pbmc@meta.data
-
head(data1)
> head(data1) nCount_RNA nFeature_RNA log10GenesPerUMI AAACATACAACCAC-1 pbmc3k 2419 779 0.8545652 AAACATTGAGCTAC-1 pbmc3k 4903 1352 0.8483970 AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8727227 AAACCGTGCTTCCG-1 pbmc3k 2639 960 0.8716423 AAACCGTGTATGCG-1 pbmc3k 980 521 0.9082689 AAACGCACTGGTAC-1 pbmc3k 2163 781 0.8673469
Assessment of quality indicators (focus)
-
cell count
-
UMI counts per cell
-
Genes detected per cell
-
UMI and detected genes
-
Mitochondrial ratio
-
Novelty
scRNA-seq-Quality Control ()
③ Seurat process for single-cell learning-pbmc_seurat Deletion of discrete cells - CSDN Blogs
1) UMI Count
nCount_RNA
: Number of UMIs per cell: no relevant QC done here!
visualization
-
# Visualize the number UMIs/transcripts per cell
-
library(ggplot2)
-
data1 %>%
-
ggplot(aes(color='', x= nCount_RNA, fill= '')) +
-
geom_density(alpha = 0.2) +
-
scale_x_log10() +
-
theme_classic() +
-
ylab("Cell density") +
-
geom_vline(xintercept = 500)
2) Gene counting
nFeature_RNA
: Number of genes detected per cell
-
data1 %>%
-
ggplot(aes(color='', x= nFeature_RNA, fill= '')) +
-
geom_density(alpha = 0.2) +
-
scale_x_log10() +
-
theme_classic() +
-
ylab("Cell density") +
-
geom_vline(xintercept = 500)
3)UMIs vs. genes detected
Two metrics that are typically evaluated together are the number of UMIs and the number of genes detected per cell. Here, we mapped theRelationship between the number of genes and the number of UMIs as a proportion of mitochondrial readsFigure. The mitochondrial reads fraction is only high (light blue) in cells with exceptionally low counts where few genes were detected. This could be damaged/dead cells whose cytoplasmic mRNAs have leaked out through ruptured membranes, so that only the mRNAs located in the mitochondria remain conserved. These cells are filtered out by our count and gene number thresholds. Combined visualization of counts and gene thresholds reveals thatCombined filtration effect。
Poor quality cells are likely to have a low number of genes and UMI per cell and correspond to the data points in the lower left quadrant of the graph. Good cells will typically exhibit more genes per cell and a higher number of UMIs.
With this figure, we also assessed the slope of the line, as well as any scatter of data points in the lower right quadrant of the plot. These cells have a large number of UMIs, but only a few genes. These may be dying cells, but may also represent a population of low-complexity cell types (i.e., erythrocytes).
-
# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs
-
p <- data1 %>%
-
ggplot(aes(x=nCount_RNA, y=nFeature_RNA, color=mitoRatio)) +
-
geom_point() +
-
scale_colour_gradient(low = "gray90", high = "black") +
-
stat_smooth(method=lm) +
-
scale_x_log10() +
-
scale_y_log10() +
-
theme_classic() +
-
geom_vline(xintercept = 500) +
-
geom_hline(yintercept = 250) +
-
facet_wrap(~)
-
p
4) Mitochondrial count ratio
This indicator allows for the identification ofDead or dying cellsIs there a large number ofmitochondrial contamination. We define a sample with poor quality mitochondrial counts as cells labeled with more than a 0.2 mitochondrial ratio, unless you want this in your sample.
-
# Visualize the distribution of mitochondrial gene expression detected per cell
-
p1 <- data1 %>%
-
ggplot(aes(color=, x=mitoRatio, fill=)) +
-
geom_density(alpha = 0.2) +
-
scale_x_log10() +
-
theme_classic() +
-
geom_vline(xintercept = 0.2)
-
p1
5) Integrated filtration
We can see that the samples where we sequenced fewer cells per cell have a higher overall complexity, and this is because we have not yet begun saturation sequencing of any given gene in these samples. In these samples theexceptionsValue cells may be cells with simpler RNA species than other cells. Sometimes we can detect contamination of low-complexity cell types, such as red blood cells, by this indicator. In general, we expect a NOVELTY score of 0.80 or higher.
-
# Visualize the overall complexity of the gene expression by visualizing the genes detected per UMI
-
p3 <- data1 %>%
-
ggplot(aes(x=log10GenesPerUMI, color = , fill=)) +
-
geom_density(alpha = 0.2) +
-
theme_classic() +
-
geom_vline(xintercept = 0.8)
-
p3
Filtering to extract subsets
-
#Filtering
-
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & < 5)
-
pbmc
scRNA-seq-Quality Control ()
[1]
The code can be used to calculate this indicator on your own./hbctraining/scRNA-seq/blob/master/lessons/[2]
Scrublet: /AllonKleinLab/scrublet