Relational database theory is a set of principles and norms about how to design, construct, manage and use relational databases. It was proposed by Edgar Coder and is considered the basis of modern databases.
The main idea of this theory is to express the relationship between data tables as relational through mathematical methodsModel, enable databases to store, organize, manage and access data in a standardized and rigorous way.Relational ModelThe core of is a relationship, an ordered table where each element has a unique identifier and a set of attributes (fields). Based on the principles of set theory and logic, the relational model provides a very accurate, consistent and understandable way to describe the relationship between data structures and data.
Relational database theory includes the following main contents:
-
Database design: including the database paradigm, is one of the basic standards of relational models, used to test whether database design satisfies the database structure.
-
Data Integrity: About how to maintain the integrity and consistency of data in a database.
-
Query processing and optimization: including how to execute SQL queries and how to optimize queriesperformanceetc.
-
Database security: Questions about how to protect data in a database from illegal access, damage or loss.
In short, relational database theory provides a unified way to describe and operate data. It is an important theoretical system in the database field and has important guiding significance for the design and management of databases.
4.1 Problems in relational pattern design
Relational schema refers to the structure definition of database tables and the relationship definition between tables in a relational database. In relational pattern design, you may encounter the following problems:
Data redundancy issues
Data redundancy refers to the phenomenon of storing the same data in a database table. Redundant data can occupy database storage space, causing slower query speeds and increase complexity and error rate when updating data. The way to eliminate data redundancy is usually to split the data into different tables and then regroup the tables together using a Join (JOIN) operation.
Paradigm design issues
Paradigms are criterion used to measure whether database table design is reasonable. Database table design should follow different paradigms to avoid data redundancy and maintenance difficulties. Different paradigms correspond to different rules, such as the first normal (1NF), the second normal (2NF) and the third normal (3NF). Typically, designers should design database tables that meet at least the third paradigm.
Entity identification issues
An entity refers to an object or concept in the real world, such as people, objects, events, etc. In relational pattern design, entities must be correctly defined and given unique identifiers. If the identifier is not unique, an error will occur or redundant data will be generated during query.
Properties definition issues
Attributes refer to properties that entities have. In relational pattern design, the properties of an entity must be correctly defined and the correct data type and length are provided for each property. If the property is not defined correctly, it will cause mismatch in data types or insufficient storage space.
Related relationship issues
Tables in relational databases can be associated with each other through association relationships to form more complex data structures. In relational schema design, the relationship between different tables must be correctly defined and the correct joining conditions and types must be specified for each relationship. If the relationship definition is incorrect, it may cause query errors, performance degradation, or inconsistent data.
When designing relationship patterns, designers need to consider the above issues. A good relationship pattern should minimize data redundancy, comply with different paradigms, correctly define entities and attributes, correctly define association relationships, and be able to meet query requests and improve query performance.
4.2 Function Dependencies
Functional Dependency (FD) is a concept in relational database design, used to describe the relationship between attributes in a table. Function dependencies are often used to eliminate redundant data and improve data normativeness and data consistency.
Basic concepts of function dependencies:
Attribute: In a relational pattern, an attribute is a column in a relational table, such as the student's student number, name, and age.
Code: In relational mode, code is an attribute or combination of attributes, whose value uniquely identifies a tuple (row in the relationship table). Generally speaking, a relation table has multiple attributes or combinations of attributes that may become codes.
Function dependency: Property A generates a function dependency on property B (denoted as A → B), if and only if in any set of tuples in the relationship table (a row of the relationship table), the value of A determines the value of B. That is, if the A attributes of two tuples are the same, their B attributes must also be the same.
Terms and symbols for function dependencies:
Left part: The attribute determined in the function dependency, A is the left part.
Right: The property in a function dependency determined by the left part, B is the right part.
The outgoing symbol: arrow (→), separated by the left and right parts, indicating that the left part can determine the right part.
Closure symbol: Closure(A+), which means that in the relational pattern, in the case of A known, a function dependency set that can be derived from A.
Function dependency inference rules:
Reflection law: If A is a property, then A → A.
Eliminate redundancy: If A → B, and B → C, then A → C.
Transmission law: If A → B, and B → C, then A → C.
Unity law: If A → B, and A → C, then A → BC.
Decomposition law: If A → BC, then A → B and A → C.
Function dependency closure
A function dependency closure represents a function dependency set, and all dependencies can be derived. The specific method for closure is:
F+ represents a closure of the function dependent set F.
Initialization: F+ = F.
Repeat the following steps until F+ no longer changes: If A → B exists in F+ and B → C exists, add A → C to F+.
Candidate code
The candidate code is the smallest set of attributes that can uniquely identify a record. A relational table may have multiple possible candidate codes. Solving candidate codes requires the following steps:
Create a new function dependency set F and add all dependencies obtained from external relationships belonging to the relationship table to F.
Find all left attributes in F that contain the same attribute set, and delete the dependencies with a large number of left attributes.
Call all left attribute sets in F the candidate attribute sets of candidate codes.
Check whether the attribute set in the candidate attribute set can uniquely identify a record. If not, delete a certain attribute from it and check again whether the record can be uniquely identified until all candidate codes are found.
Minimal function dependency set
A minimal function dependency set refers to a set of all function dependencies containing the least left attribute in the function dependency set F. Solving a minimal function dependency set requires the following steps:
The right-hand property set of all function dependencies in F is called the principle set.
For each function dependency in F, delete one of the properties to see if it can still meet the function dependency conditions. If so, add it to the small function dependency set.
By solving the set of small function dependencies, unnecessary function dependencies can be removed to reduce data redundancy.
4.3 Paradigm
In relational databases, paradigms are a standard for normalizing the design of database tables, with the aim of reducing redundant data, improving the efficiency of data storage, and reducing the complexity of data updates. There are three most commonly used paradigms, namely the first, the second and the third.
The first normal form (1NF)
The basic requirement of the first normal formula is that each property in the database table is atomic, that is, indecomposed.
Simply put, the first normal formula requires that each column in the table contains only a single attribute value and that multiple values are not allowed to exist in the same column. For example, if a student's course grades are stored in a column separated by commas, it does not conform to the first normal form.
Second Normal Form (2NF)
On the basis that the second normal formula requires meeting the first normal formula, the non-primary key column in each table must rely entirely on the primary key column, and cannot rely only on part of the primary key.
For example, if the student and course grades are stored in a table, the student ID and course code in that table form the primary key together, but the course name
4.4 Decomposition of relational patterns
Relational pattern decomposition is the process of splitting a relationship pattern (table) into two or more relationship patterns (tables). This process is called "relational pattern decomposition", and it is an important step in database design.
Database designers usually need to decompose a large relationship pattern into multiple small relationship patterns, which can improve query performance, improve data integrity and flexibility, etc. The decomposition of a relational pattern can follow the following steps:
Determine the relational patterns to be broken down: First, you need to analyze all relational patterns in the entire database and identify which relational patterns should be broken down and record them in a list.
Principles for judging decomposition: Decomposition of relational patterns usually follows two principles: function dependence and multi-value dependence. In a function dependency, one attribute or set of attributes (called "determinants") determines the value of another attribute. In a multi-value dependency, one attribute or attribute set determines the value of another attribute set.
Perform decomposition: Once the relationship pattern to be decomposed and the principles of decomposition have been determined, the decomposition operation can be performed. The process of decomposition is to decompose the original relationship pattern into two or more relationship patterns.
Determine a new relationship pattern: The decomposed new relationship pattern needs to meet some conditions, such as each relationship pattern needs to have a primary key, each attribute only appears in one relationship pattern, etc.
Determine foreign keys: In the new relationship pattern after decomposition, foreign keys need to be defined for establishing relationships to ensure data integrity.
It should be noted that relational pattern decomposition is not always necessary, and sometimes it may introduce some unnecessary redundancy. Therefore, before decomposing the relational pattern, sufficient analysis and design of the database architecture is required. ‘’
Suppose there is a relationship pattern (table) that includes students and their selected course grades as follows:
Student_Course_Grade (student name, course number, course name, grade)
This relational pattern contains a composite primary key (student name and course number) composed of two attribute values, which does not meet the requirements of the second normal format. Therefore, this relationship model needs to be decomposed and decomposed into two relationship models: one contains student information and the other contains course grade information.
The first step is to determine the relational pattern to be broken down, i.e. StudentCourseGrade.
The second step is to judge the principle of decomposition. For this relational pattern, it can be seen that the following function dependencies exist:
Student Name → Student Information (for example: Gender, Age, etc.)
Course number → Course name
Student Name and Course Number → Grades
According to the principle of function dependence, it should be decomposed into the following two relational patterns:
Student (student name, student information)
Course (course number, course name, grade)
In this decomposition scheme, the student name becomes the primary key of the Student table, the course number becomes the primary key of the Course table, and the grade becomes a property of the Course table.
Such a decomposition scheme not only meets the requirements of the second normal form, but also improves expressiveness and standardization, while improving the efficiency of data storage and reducing the complexity of data updates.
4.5 Query Optimization
4.5.1 SQL query processing
SQL query processing refers to the process of converting SQL query statements into actual operations on the database. Query processing includes the following steps:
1. Lexer: Decompose SQL query statements into words (Tokens), such as keywords, column names, operators, separators, etc.
2. Parser: According to SQL syntax rules, the tokens generated by the lexical analyzer are combined into a syntax tree (Parse Tree), thereby representing the semantic structure of the SQL query statement.
3. Query Transformation: In this step, the Optimizer will analyze the query statement and convert it into a more efficient query plan to process query requests at the lowest query cost.
4. Query Execution: In this step, the database management system will execute the query plan generated in the query conversion step, read the data from the database, and return the result set to the user. This step is the most consuming step of calculation and time.
5.Result Presentation: Display or output the query results in the required form, such as text tables, graphics, JSON format, etc.
In the process of query processing, the optimizer (Optimizer) plays an important role. It optimizes query statements to make query planning more efficient. There are many ways to optimize, such as choosing the right queryalgorithm, merge multiple query tasks, etc.
Therefore, in the SQL query processing process, we must write query statements and finally obtain the query results through lexical analysis, syntax analysis, query conversion and query execution steps. During the query conversion process, the query optimizer intends to improve query efficiency and query performance, so that the query takes less time.
4.5.2 SQL query optimization
SQL query optimization is a generalized concept, including almost all aspects of query optimization. Among them, algebraic optimization and physical optimization are two main aspects of SQL query optimization.
Algebraic optimization
Algebraic optimization is an optimization technology for SQL query processing. It is an optimization algorithm designed based on mathematical algebra theory. Algebraic optimization focuses on optimizing the structure of query parsing trees and improving the efficiency of query algorithms, thereby optimizing query query efficiency.
Algebraic optimization will perform some advanced optimizations to query statements, such as:
Merge query algorithms with the same conditions into a more efficient algorithm to avoid repeated calculations.
Translate the conditions to improve query performance and avoid temporary tables.
Rewrite query statements to improve query performance.
Algebraic optimization usually attempts to reasonably simplify and reorganize SQL query statements after they are converted into query parsing trees to achieve faster query speed and less resource usage.
Heuristic rules for SQL algebra optimization
- Merge query
Combining multiple query operations into one query operation can reduce query complexity, avoid repeated scans of the same table, and improve query performance.
- Projection push down
Perform the SELECT operation as early as possible to reduce the number of unprocessed rows, thus reducing the amount of query calculations.
- Eliminate useless relationship operations
When redundant relationship operations exist in queries, they can be eliminated as early as possible, simplifying query operations, thereby improving query performance.
- Merge neighbor relationship operations
Merging adjacent relational operations (such as UNION, INTERSECT, EXCEPT, etc.) can reduce the amount of query calculation and improve query performance.
- Related subquery conversion to Join operation
Convert relevant subquery into Join operations to improve query calculation and query performance.
Physical Optimization
Physical optimization is another optimization technology for SQL query processing. It is to enable the SQL engine to better utilize the underlying storage devices and computing resources. Physical optimization pays more attention to the execution method of query plans, and makes adjustments to the allocation and use of computer resources to improve the efficiency of various query operations.
Specific examples of physical optimization are:
Select the use sorting method according to the size of the table and the index usage
Analyze disk and memory usage during query to implement appropriate disk storage solutions.
Plan multiple queriesparallelimplement.
Predict the cache reads traffic volume and take steps to relieve I/O pressure.
It can be seen that algebraic optimization mainly reorganizes and simplifies SQL query statements, while physical optimization involves more technologies and focuses on the actual physical execution of query plans, as well as the optimization of query resources including cache, disk, memory and other technical resources. Both are commonly used technical means in SQL query optimization, which can greatly improve database performance and query speed.
Heuristic rules for physical optimization
- Use the right storage engine
Different storage engines have different characteristics and applicable scenarios. Choosing the appropriate storage engine based on factors such as data volume, data type, query mode, etc. can improve query performance and processing speed.
- Select the right index
Choosing the right index is the key to improving query performance. ByData AnalysisObservation of query behavior, determining which columns are used for indexing and which index types are used, can help the database system access data more efficiently.
- Data partitioning and segmentation
If a query involves a large amount of data, you can divide the data into multiple parts and then use a different query algorithm for each part, or process it in parallel on different nodes to improve query efficiency.
- Execution plan cache
Execution plan cache can avoid duplicate query calculations and improve query performance. It is suitable for scenarios with high query duplication rates.
The above are some of the more common heuristic rules in SQL algebraic optimization and physical optimization. In different scenarios, more specific rules can be formulated based on actual conditions to achieve more precise optimization goals.