1. Full scale: all the latest data on the daily status,
2. Incremental table: New data added every day, incremental data is the new data after the last export.
3. Zipper list: a table that maintains historical status and the latest status data. Zipper list is actually equivalent to a snapshot according to the different zipper granularity. It is just optimized to remove some unchanged records. The zipper list can easily restore the customer records at the zipper point.
4. Flow table: Every modification of the table will be recorded and can be used to reflect the changes in actual records.
Zipper lists are usually the result of processing and retaining historical changes in account information, and the flow table is the history of daily transactions;
The flow table is used to count business-related situations, and the zipper list is used to count accounts and customers' situations.
Zipper list of data warehouse (principles, design and implementation in Hive)
In some cases, in order to maintain some states in history, it is necessary to use a zipper list, which can save space while preserving all states.
Zipper list is suitable for the following situations
The amount of data is a bit large, and some fields in the table have changed, but the frequency of changes is not very high. Business needs to count this change status. It is a bit unrealistic to calculate the full amount every day.
Not only is it a waste of storage space, but sometimes business statistics may also be a bit troublesome. At this time, the role of the zipper list is withdrawn, saving space and meeting the needs.
Generally, it is represented by adding begin_date and en_date in the number bin. In the following example, the last two columns are start_date and end_date.
1 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-20
1 2016-08-20 2016-08-21 Payment 2016-08-21 2016-08-21
1 2016-08-20 2016-08-22 Completed 2016-08-22 9999-12-31
2 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-20
2 2016-08-20 2016-08-21 Completed 2016-08-21 9999-12-31
3 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-21
3 2016-08-20 2016-08-22 Payment 2016-08-22 9999-12-31
4 2016-08-21 2016-08-21 Created 2016-08-21 2016-08-21
4 2016-08-21 2016-08-22 Payment 2016-08-22 9999-12-31
5 2016-08-22 2016-08-22 Created 2016-08-22 9999-12-31
begin_date represents the life cycle start time of the record, and end_date represents the life cycle end time of the record;
end_date = ‘9999-12-31’ means that the record is currently in valid status;
If all currently valid records are query, select * from order_his where dw_end_date = ‘9999-12-31'
If you query the historical snapshot of 2016-08-21, select * from order_his where begin_date <= ‘2016-08-21′ and end_date >= ‘2016-08-21’
Let me briefly introduce the update of the zipper list:
Suppose it is based on the dimension of the day, and the last state of the day is the final state of the day.
Taking an order table as an example, the following is the original data and the daily order status details
1 2016-08-20 2016-08-20 Created
2 2016-08-20 2016-08-20 Created
3 2016-08-20 2016-08-20 Created
1 2016-08-20 2016-08-21 Payment
2 2016-08-20 2016-08-21 Completed
4 2016-08-21 2016-08-21 Created
1 2016-08-20 2016-08-22 Completed
3 2016-08-20 2016-08-22 Payment
4 2016-08-21 2016-08-22 Payment
5 2016-08-22 2016-08-22 Created
According to the zipper list, what we want is
1 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-20
1 2016-08-20 2016-08-21 Payment 2016-08-21 2016-08-21
1 2016-08-20 2016-08-22 Completed 2016-08-22 9999-12-31
2 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-20
2 2016-08-20 2016-08-21 Completed 2016-08-21 9999-12-31
3 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-21
3 2016-08-20 2016-08-22 Payment 2016-08-22 9999-12-31
4 2016-08-21 2016-08-21 Created 2016-08-21 2016-08-21
4 2016-08-21 2016-08-22 Payment 2016-08-22 9999-12-31
5 2016-08-22 2016-08-22 Created 2016-08-22 9999-12-31
It can be seen that 1, 2, 3, 4 has the status of each order, and the current valid status can also be counted.
This example takes hive as an example, only considers the implementation and has nothing to do with performance
Create the table first
CREATE TABLE orders (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) row format delimited fields terminated by '\t'
CREATE TABLE ods_orders_inc (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) PARTITIONED BY (day STRING)
row format delimited fields terminated by '\t'
CREATE TABLE dw_orders_his (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
dw_start_date STRING,
dw_end_date STRING
) row format delimited fields terminated by '\t' ;
First, we will update the data until 2016-08-20.
Initialize, first initialize the data from 2016-08-20
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2016-08-20')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime < '2016-08-21' and modified time <'2016-08-21';
Flash to dw
INSERT overwrite TABLE dw_orders_his
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2016-08-20';
The following results
select * from dw_orders_his;
OK
1 2016-08-20 2016-08-20 Created 2016-08-20 9999-12-31
2 2016-08-20 2016-08-20 Created 2016-08-20 9999-12-31
3 2016-08-20 2016-08-20 Created 2016-08-20 9999-12-31
The remaining needs incremental updates
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2016-08-21')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE (createtime = '2016-08-21' and modifiedtime = '2016-08-21') OR modifiedtime = '2016-08-21';
select * from ods_orders_inc where day='2016-08-21';
OK
1 2016-08-20 2016-08-21 Payment 2016-08-21
2 2016-08-20 2016-08-21 Completed 2016-08-21
4 2016-08-21 2016-08-21 Created 2016-08-21
First put it in the incremental table, then associate it to a temporary table, and insert it into a new table
DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
SELECT,
,
,
,
a.dw_start_date,
CASE WHEN IS NOT NULL AND a.dw_end_date > '2016-08-21' THEN '2016-08-21' ELSE a.dw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day = '2016-08-21') b
ON ( = )
UNION ALL
SELECT orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2016-08-21'
) x
ORDER BY orderid,dw_start_date;
INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;
According to the above steps, update the data of 2016-08-22, and the final result is as follows
select * from dw_orders_his;
OK
1 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-20
1 2016-08-20 2016-08-21 Payment 2016-08-21 2016-08-21
1 2016-08-20 2016-08-22 Completed 2016-08-22 9999-12-31
2 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-20
2 2016-08-20 2016-08-21 Completed 2016-08-21 9999-12-31
3 2016-08-20 2016-08-20 Created 2016-08-20 2016-08-21
3 2016-08-20 2016-08-22 Payment 2016-08-22 9999-12-31
4 2016-08-21 2016-08-21 Created 2016-08-21 2016-08-21
4 2016-08-21 2016-08-22 Payment 2016-08-22 9999-12-31
5 2016-08-22 2016-08-22 Created 2016-08-22 9999-12-31
At this point, we get the data we want.
It is worth noting that the data in the order table has multiple status updates on the same day, and the last status of the day should be the final status of the day. For example, the order status is created, paid, and completed within one day. The final status should be pulled for training table updates, otherwise the subsequent data may have abnormalities, such as
6 2016-08-22 2016-08-22 Created 2016-08-22 9999-12-31
6 2016-08-22 2016-08-22 Payment 2016-08-22 9999-12-31
6 2016-08-22 2016-08-22 Completed 2016-08-22 9999-12-31
/wujin/p/
/p/799252156379
/archives/2015/04/