Data Warehouse Knowledge 10_Data Generalization

dataGeneralization(Attribute-oriented induction; personality->generalization)

1. Definition

Data generalization: Replace lower-level conceptual layers (e.g., numerical ranges of age) with higher-level concepts (e.g., youth, middle-aged, and elderly) to summarize data. Or summarize data in a concept space with fewer dimensions by reducing dimensions (e.g., deletion of birthday and phone number attributes when summarizing student groups)

2. Two generalization methods

(1) Data focusing based on data cube:

a. Complex data types and aggregation
Data Warehouseand OLAP tools based on multidimensional dataModel, to read the data in cube form, by dimensions (or attributes) and metrics (aggregatedfunction)composition. However, many OLAP systems restrict dimensions to be non-numerical data, while metrics are numerical data. A database may include various types of attributes, including numerical, non-numerical, idle, textual or image.

b. User control and automatic processing
Online analysis and processing in the data warehouse is a user-controlled process. The selection of dimensions and the use of OLAP operations (upper roll, down, slice, diced), etc. are directed and controlled by the user.

(2) Attribute-oriented induction

Database query mobile phone data –>Generalize according to different attribute values, generally two methods:

Property Deletion: Initial work, a certain attribute is composed of a large number of different values, but the attribute does not have a generalization operator or its higher-level concept is represented by other attributes

Attribute generalization: A property in the initial work has a large number of different values, and there is a collection of generalized operators on that property. A generalized operator should be selected and used for the property.

Summary: There are a large number of different values of attributes that should be further generalized.

3. Generalization control

Attribute generalization is too high – leading to excessive generalization and generating useless information

Insufficient generalization –> Too little information

Method 1: Attribute generalization threshold control
 Set the attribute threshold, usually with values of 2~8, and can be drilled up or drilled down according to the actual value.

 Method 2: Generalized relational threshold control
 Set the number of tuples, usually 10~30

4. The implementation process of attribute oriented induction

(1) The first step of the algorithm is basically relationship query, collecting task-related data into the working relationship W. Its effectiveness depends on the query processing method used.

(2) Collect statistics on the initial relationship. This requires scanning the relationship at most once.

(3) Export the main relationship P by scanning each tuple of the working relationship and inserting the generalized tuple into P.

5. Attribute-oriented induction of class comparison

(1) The process of class comparison

a. Data collection: Collect relevant data in the database through query processing and divide it into a target class and one or more comparison classes.
 b. Dimension correlation analysis: There are multiple dimensions that require dimension correlation analysis on the class.
 c. Synchronous generalization: Generalization is carried out on the target class, generalizing to the layer of dimensional threshold control specified by users or domain experts, and generates the main target class relationship.
 d. Export the representation of comparison