I. Introduction to monocle3
There are 3 main functions that monocle3 can perform
Clustering, sorting and counting of cells.Monocle 3 recognizes new (and possibly rare) subtypes of cells
Constructing single-cell trajectories.pass (a bill or inspection etc)proposed chronological analysisIt helps to parse the changes that occur in cells during the development of organisms, diseases, etc. This is the primary function.
Differential expression analysis.Monocle 3 includes a sophisticated but easy-to-use differential expression system that allows the characterization of new cell types and states to begin with comparisons to other, better understood cells.
Main Processes
1. Read the data and create the cell_data_set object.
2. Data preprocessing: standardization, removal of batch effects
3.(math.) lower dimensionality
4.clustering
5. Perform differential gene expression analysis
6. To bechronological analysis
II. Approaches to reading different data
(i) Bioconductor's ExpressionSet object:
The data read by monocle3 is to contain 3 parts:
- expression_matrix: numeric matrix of expression values, where rows are genes and columns are cells
-
cell_metadata
:
Data box where rows are cells and columns are cell attributes (e.g., cell type, culture conditions, acquisition date, etc.) -
gene_metadata
:
Data frame where rows are features (e.g. genes) and columns are gene attributes such as biotype, GC content, etc.
There are a couple of "equations" you should make sure of before typing.
- expression_matrix number of columns = cell_metadata number of rows and the two should match.
- expression_matrix number of rows = gene_metadata number of columns and the two should match.
- One of the columns of gene_metadata should be "gene_short_name", indicating the gene symbol or short name of each gene.
Example of Object Creation
-
# Load the data
-
expression_matrix <- readRDS(url(":/trapnell-lab/software/monocle3/celegans/data/cao_l2_expression.rds"))
-
cell_metadata <- readRDS(url(":/trapnell-lab/software/monocle3/celegans/data/cao_l2_colData.rds"))
-
gene_annotation <- readRDS(url(":/trapnell-lab/software/monocle3/celegans/data/cao_l2_rowData.rds"))
-
-
-
# Make the CDS object
-
cds <- new_cell_data_set(expression_matrix,
-
cell_metadata = cell_metadata,
-
gene_metadata = gene_annotation)
(ii) Generate cell_data_set from 10X outputs
To find the correct file, the path to the folder containing the unmodified Cell Ranger "outs" folder must be provided. The file structure should be as follows: 10x_data/outs/filtered_feature_bc_matrix/, wherefiltered_feature_bc_matrix contains files, and. (It can also process Cell Ranger V2 data, whichFor "feature" read "gene"., and the file is not gz-compressed. (This is easy to read and report an error)
Read the example: Note the use offunction (math.)beload_cellranger_data
-
# Provide the path to the Cell Ranger output.
-
cds <- load_cellranger_data("~/Downloads/10x_data")
Or if there are 3 files as follows, they can be read directly: note that the function to use isload_mm_data
, and
-
cds <- load_mm_data(mat_path = "~/Downloads/",
-
feature_anno_path = "~/Downloads/",
-
cell_anno_path = "~/Downloads/")
(iii) Handling large amounts of data
When the amount of cells or genes is relatively large, you can use the sparse matrix method to read the
-
cds <- new_cell_data_set(as(umi_matrix, "sparseMatrix"),
-
cell_metadata = cell_metadata,
-
gene_metadata = gene_metadata)
(iv) Combination of multiple cds objects
-
# make a fake second cds object for demonstration
-
cds2 <- cds[1:100,]
-
-
big_cds <- combine_cds(list(cds, cds2))
II. Data pre-processing
(i) Data standardization
-
cds <- preprocess_cds(cds, num_dim = 100)# Mainly to complete standardization of data + PCA
-
plot_pc_variance_explained(cds)# Viewed are the variance scores for each dimension
-
cds <- align_cds(cds, alignment_group = "plate")#Remove the batch effect
View the variant scores for each PC:plot_pc_variance_explained()
(ii) Dimensional analysis and visualization
Monocle 3 uses UMAP downscaling by default, and there are two downscaling methods to choose from: PCA and LSI, PCA is chosen for RNA-seq data, and LSI is used in case of ATAC-seq.
cds <- reduce_dimension(cds,preprocess_method = 'PCA')
- dimensionality reduction method
method = c("PCA", "LSI")
The default is PCA. - Default dimension
num_dim
It's 50. - Normalization before preliminary dimensionality reduction
norm_method
The default is log processing - Normalization is automatically performed when the downscaling method is set to PCA
Visualization:
plot_cells(cds)
Coloring of subgroups
-
colnames(colData(cds))# Can pick any one for coloring
-
#[1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue"
-
#[5] "Size_Factor"
-
plot_cells(cds, color_cells_by="cao_cell_type")
View the distribution of expression of certain genes
-
ciliated_genes <- c("che-1",
-
"hlh-17",
-
"nhr-6",
-
"dmd-6",
-
"ceh-36",
-
"ham-1")
-
-
plot_cells(cds,
-
genes=ciliated_genes,
-
label_cell_groups=FALSE,
-
show_trajectory_graph=FALSE)
tsne dimensionality reduction
As you can see the results are not very good
-
# can also be downscaled with tsne's method, but it doesn't work well
-
cds <- reduce_dimension(cds, reduction_method="tSNE")
-
plot_cells(cds, color_cells_by="cao_cell_type", reduction_method="tSNE")
-
plot_cells(cds, reduction_method="tSNE",
-
color_cells_by="cao_cell_type",
-
cell_size = 2,
-
group_label_size = 3)
Third, cell clustering (cluster your cells)
This is a very important step in the process of cell sorting.Monocle utilized Louvaincommunity detection This unsupervised clustering method (which isphenoGraph
algorithm), which is not quite the same as our common Kmeans, hclust hierarchical clustering, etc., this approachPrefer to find a well-connected part of the networkpoints, while often ignoring the properties of the nodes; and weCommon clusteringWhat about it?Generally ignoring the connections between the parts of the whole, theby means ofCalculate the distance between two nodes (targets) (e.g. Euclidean distance, Manhattan distance, cosine similarity, etc.) Finding similar characteristics
(i) Clustering
The clustering here is no longer clustering alone, according to the authors, it is partitioning the cells, and the trajectory analyses that go after will be performed for this partition. partition is an aggregation of some small clusters, and it will look messy if the coloring is based on the clusters.
-
cds <- cluster_cells(cds)#After clustering, each cluster will form its own "proposed temporal trajectory".
-
plot_cells(cds, color_cells_by = "partition")
(ii) Find the primary path in each partition using learn graph
Identify trajectories: for each clusterprincipal component analysis (PCA)After this step, the proposed timing map has been initially generated.
-
cds <- learn_graph(cds)# Perform principal component analysis on each cluster, after this step the proposed timing map has been initially generated
-
plot_cells(cds,
-
color_cells_by = "cao_cell_type",
-
label_groups_by_cluster=FALSE,
-
label_leaves=FALSE,
-
label_branch_points=FALSE)
-
plot_cells(cds,
-
color_cells_by = "cao_cell_type",
-
label_cell_groups=FALSE,
-
label_leaves=TRUE,#Show Branches
-
label_branch_points=TRUE,# Show branch nodes
-
graph_label_size=1.5)
- figureblack lineIt's the whole architecture (notice that the whole graph is not fully connected, after all, it's clustered by PARTITION and the cells in each PARTITION vary a bit);
-
Light gray circleDenote "leaves" to indicate the different endings in the developmental trajectory (one can think of the graph to be analyzed chronologically as a "root-stem-leaf" structure), with the parameter
label_leaves
Control; -
Black circledenotes a branching node, which predicts that the cells in it will develop in different directions, with the parameter
label_branch_points
Control; - The size of the numbers in the circles indicates the order of appearance
IV. Proposed time-series analysis
(i) First tell monocle3 which is the starting point for trajectory analysis
1. Manually click to add
-
cds <- order_cells(cds)
-
plot_cells(cds,
-
color_cells_by = "pseudotime",
-
label_cell_groups=FALSE,
-
label_leaves=FALSE,
-
label_branch_points=FALSE,
-
graph_label_size=1.5)
Demonstrate the expression of certain genes over time trajectories
-
# A demonstration of a gene
-
plot_genes_in_pseudotime(cds[1:5,],
-
color_cells_by="cao_cell_type",
-
min_expr=0.5)
2. code added, get_earliest_principal_node() function can specify the name of the starting point, in this case specify Am/PH sheath cells near this cell type as the starting point
-
# Method I
-
get_earliest_principal_node <- function(cds, my_select="Am/PH sheath cells"){
-
cell_ids <- which(colData(cds)[, "cao_cell_type"] == my_select)
-
closest_vertex <-
-
cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
-
closest_vertex <- (closest_vertex[colnames(cds), ])
-
root_pr_nodes <-
-
igraph::V(principal_graph(cds)[["UMAP"]])$name[(names
-
((table(closest_vertex[cell_ids,]))))]
-
-
root_pr_nodes
-
}# Select the cell type you want to define as a starting point
-
cds <- order_cells(cds, root_pr_nodes=get_earliest_principal_node(cds))# Define the starting point
-
# Another way
-
myselect <- function(cds,,my_select){
-
cell_ids <- which(colData(cds)[,] == my_select)
-
closest_vertex <-
-
cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
-
closest_vertex <- (closest_vertex[colnames(cds), ])
-
root_pr_nodes <-
-
igraph::V(principal_graph(cds)[["UMAP"]])$name[(names
-
((table(closest_vertex[cell_ids,]))))]
-
-
root_pr_nodes
-
}
-
cds <- order_cells(cds,
-
root_pr_nodes=myselect(cds, = 'cao_cell_type',
-
my_select = "Body wall muscle")
-
# Drawing
-
plot_cells(cds,
-
color_cells_by = "pseudotime",
-
label_cell_groups=FALSE,
-
label_leaves=FALSE,
-
label_branch_points=FALSE,
-
graph_label_size=1.5)|plot_cells(cds,
-
color_cells_by = "cao_cell_type",
-
label_cell_groups=FALSE,
-
label_leaves=FALSE,
-
label_branch_points=FALSE,
-
graph_label_size=1.5)# Two types of coloring
You can see that if you color the area based on pseudotime, it is the purple area starting from the Body wall muscle, and the serial number 1 is the area where this cell is located.
(ii) 3D version of the proposed timing analysis
The 3D trajectory is actually the selection of the first 3 principal components when downscaling =>max_components = 3
The follow-ups are all similar to the 2D
-
cds_3d <- reduce_dimension(cds, max_components = 3)#Downgrade
-
cds_3d <- cluster_cells(cds_3d)#clustering
-
cds_3d <- learn_graph(cds_3d)# Trajectory learning
-
cds <- order_cells(cds,
-
root_pr_nodes=myselect(cds, = 'cao_cell_type',
-
my_select = "Body wall muscle")
-
)# Select the starting point of the proposed timing analysis, as in the previous 2D, with the code
-
cds_3d_plot_obj <- plot_cells_3d(cds_3d, color_cells_by="cao_cell_type")# Drawing
V. Differential gene analysis
Differential Expression Analysis, which performs a trajectory analysis of genes to find out which genes are most variable in this trajectory, thegraph_test()
This function performs a spatial autocorrelation analysis method called Moran's I
subset_pr_test_res <- graph_test(cds, neighbor_graph="principal_graph", cores=4)#cores is the specified number of cores
After running it well you will get a table that
Moran's I
This indicator is called: the Moran Index, proposed by the Australian statistician Moran in 1950. It is used toMeasuring spatial correlationAn indicator that the data will fall between -1.0 and 1.0 after normalization for variance. A value greater than 0 indicates a positive spatial correlation, the larger the value the stronger the spatial correlation; equal to 0 the space is randomly distributed; less than 0 indicates a negative spatial correlation, the smaller the value the greater the spatial variance
Morans_I (spatial co-expression), whose value is closer to 1 means that the gene is more similarly expressed in spatially distant cells, and 0 means that there is no spatial co-expression effect.
Genetic Characterization Map
-
#We'll choose the first ten genes here according to the size of morans_I
-
top10 <- subset_pr_test_res %>% top_n(n=10, morans_I) %>%
-
pull(gene_short_name) %>% ()
-
-
#Gene Expression Trend Plot, Plotting, can't draw, keeps reporting error.
-
p <- plot_genes_in_pseudotime(cds[top10,],
-
color_cells_by="cao_cell_type",
-
min_expr=0.5,
-
ncol = 4)
-
#Gene Characterization Maps
-
plot_cells(cds, genes=top10, show_trajectory_graph=FALSE,
-
label_cell_groups=FALSE, label_leaves=FALSE)
Finding co-expressed gene modules
This one is looking for the correlation of modules with cells/clusters/partitions, where modules are co-expressed gene modules, and the genes in each module may have similar expression trends.
-
## Finding co-expression modules
-
genelist <- pull(subset_pr_test_res, gene_short_name) %>% ()
-
gene_module <- find_gene_modules(cds[genelist,], resolution=1e-2, cores = 5)
-
cell_group <- tibble::tibble(cell=(colData(cds)),
-
cell_group=colData(cds)$celltype)
-
agg_mat <- aggregate_gene_expression(cds, gene_module, cell_group)
-
(agg_mat) <- stringr::str_c("Module ", (agg_mat))
-
pheatmap::pheatmap(agg_mat, scale="column", clustering_method="ward.D2")