GO and KEGG are databases of gene-related functions stored based on different classification ideas, and enrichment analysis is an algorithm that integrates these functions.GO Enrichment, which is the study of the nature of genes, describes, at three levels, the molecular function, the cellular component, and the biological process in which the gene is involved. For example, a gene may have a molecular function of catalytic activity, a cellular component, i.e., its localization in the cell as a cell membrane, and participation in a biological process as a protein transport process, and this is how the gene is defined according to the three different classifications.
The main differences between KEGG and GO are, the three dimensions of GO are not connected to each other in any way.KEGGNot only is there a gene set, but it also defines the complex interrelationships between genes and metabolites, which is why it can be called a PATHWAY, somewhat similar to the biological processes in GO.
The basis of the GO database is one GO term after another, which are tree-like structures with redundancies.The root of the GO databasenodeThere are three of them, BP, CC, and MF, respectively.KEGG is the artificially annotated metabolic pathway after metabolic pathway, which is mesh-like.GO term is a pure set of genes without defining the interrelationships of the genes in it, while KEGG not only has a set of genes, but also defines the complex interrelationships between the genes and the metabolites, which is why it can be called a pathway.In terms of the similarities, from the perspective of a pure set of genes, the BP of GO and KEGG have higher similarities. In terms of similarity, from the perspective of pure gene set, the BP of GO and KEGG have a high similarity.GO is generally used to look for functional changes caused by differential genes, while KEGG looks for effects on pathways.
But both GO and KEGG base their enrichment methods on statistical hypergeometric distributions. Suppose there are m background genes, there are n genes annotated in a certain pathway pathway in the background genes, and there are k genes in my gene set, and l of them are enriched into that pathway, in simple terms, it means comparing whether l/k is significantly higher than n/m, and calculating a p-value to determine whether this kind of thing happens by chance or not, if yes, then this can't be said to be enriched into it because it is just a casual If it is, then this can't be said to be enriched, because it is just a coincidence, if it is not, that is, it is deliberately enriched in a certain pathway.
(indicates contrast)Gene Set Enrichment Analysis (GSEA) compared to GO (Gene Ontology) and KEGG pathway analysis.The main advantages of GSEA analysis are:
General differential analysis (GO and Pathway) tends to focus on comparing gene expression differences between two groups, concentrating on a few significantly up-regulated or down-regulated genes, which tends to miss some genes that are not significantly differentially expressed but have important biological significance, such as specifying that the multiplicity of difference thresholds for differential gene screening are 0.1 and 0.25, which ignores some of the genes' biological properties, the relationships among gene regulatory networks, and gene functions and significance, and other valuable information. This ignores valuable information about the biological properties of some genes, the relationship between gene regulatory networks, and the functions and significance of genes.GSEAThere is no need to delimit the threshold, he is based on the expression of the gene to sort the gene, and then go and compare with the database in GSEA, to give each gene an ES enrichment score, another difference between GSEA and GO,KEGG is that GSEA need to enter the expression of the gene, while the other two only need to enter the list of genes can be.
Principles of GSEA:
Call the ordered list of differential genes that one has measured a target gene list L. Call the set of genes predefined on the basis of a priori knowledge a functional gene set S. Call the members of this gene set s.
GSEA runs on the principle of determining whether the members s inside the functional gene set S are randomly distributed inside the target gene list L, or whether they are mainly clustered at the top or the bottom of the target gene list L. If the members of the functional gene set S we study are significantly clustered at the top or bottom of the target gene list L, it is the gene set we want to focus on.
Just for example, in this figure the target gene gene list L is all the differentially expressed genes in C2 and C4, and the functional genes S are all the genes in C2 and C4 that are related to the cell cycle, and the important thing about the results obtained by GSEA is the enrichment score, which is the blue line inside this figure.
Enrichment Score, or ES, Chineserenderingis the enrichment score. It responds to the extent to which the gene set member s is enriched at the L end of the target gene list L. It is calculated as a cumulative statistic value, starting from the first gene in the target gene list L. When a gene that falls inside the functional gene set S is encountered, the statistical value is increased. When a gene that does not fall inside the functional gene set S is encountered, the statistical value is decreased. The magnitude of the increase or decrease in the statistical value at each step is correlated with the degree of change in gene expression. The enrichment score ES is calculated from the time of no encounter until the maximum value. A positive value of ES indicates that the gene set is enriched at the top of the list and a negative value of ES indicates that the gene set is enriched at the bottom of the list.
So that's how we know that inside these graphs, the functional genes are all in the C4 cell cluster.
The barcode-like black lines in the center are the positions of the genes in the gene set in the background genes, and each vertical line represents the gene under the pathway, "hit" will have a black line, "miss" will have no black line.
Butterfly diagram: when using the functional gene set S from top to bottom, traversing the list of sorted target genes L, at this time, the green area at the bottom is the result of the sorting of different genes, which is related to the grouping situation, and the results of the sorting are ranked from positive to negative values, the positive value is related to the 1st grouping (C2), and the negative value is related to the 2nd grouping (C4), and the high and low of the green area is correlated to the expression level of the genes.