web123456

MySql million data de-duplication to keep one copy, one-to-many merge processing

I recently helped the Ops team with a rush of data that was on the shelf, about 1.4 million volumes, so let's look at how it was handled.

Most of this data is duplicated, and the processed data is around 300,000, and then a one-to-many merge process is performed.

The following provides two methods or ideas, the first idea is to provide more of a kind of online, that is, through the uniqueness of the constraints on the field for sorting, and then query processing out of these data are duplicates of the first data, so the id is bound to be the smallest of these duplicates, and then delete these duplicates of the id is not the smallest of these duplicates can be derived from the screening results.

Of course the data above the design to commercial confidentiality, here I will create a demo myself.

The sql is as follows:

  1. // Filter out the duplicate data with the smallest id and delete it in the second step
  2. select id FROM haha
  3. where CONCAT(namea,sex,num) in (
  4. SELECT CONCAT(namea,sex,num) FROM haha
  5. GROUP BY CONCAT(namea,sex,num)
  6. HAVING count(id> 1)
  7. )
  8. //Perform a delete operation to remove redundant data
  9. DELETE FROM haha where id in (
  10. select id FROM haha
  11. where CONCAT(namea,sex,num) in (
  12. SELECT CONCAT(namea,sex,num) FROM haha
  13. GROUP BY CONCAT(namea,sex,num)
  14. HAVING count(id> 1)
  15. )
  16. and id not in (
  17. SELECT MIN(id) as id FROM haha
  18. GROUP BY CONCAT(namea,sex,num)
  19. HAVING COUNT(id > 1)
  20. ))

Of course there were some errors encountered in writing the sql.

1, select * from haha where id in (select id , namea from haha ) select in the right side can only be id, so namea should be removed, in addition, name in theMySqlThe keyword is displayed in the field, so be careful when naming this field to avoid causing unnecessary trouble or errors.

2、ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your

This error is saying that there is an error in the symbols in the sql statement, so we need to check the punctuation, whether the comma is in English or not, and then whether the parentheses are too much or too little

3、To note the place, MySql does not allow direct deletion of the query out of the data, so we query out of the data to be stored in a temporary table in the delete

The second method, because we know that the final result of the data to be processed is about 300,000, which for 1.4 million, go to delete the remaining 1.1 million data, obviously we only need to focus on the 300,000 data can be, so we directly import the query out of the data into a new table, the

  1. create table finish as
  2. select * FROM haha
  3. where CONCAT(namea,sex,num) in (
  4. SELECT CONCAT(namea,sex,num) FROM haha
  5. GROUP BY CONCAT(namea,sex,num)
  6. HAVING count(id> 1)
  7. )

The data is then processed 1-to-many:

The original data looked like this:

The results are as follows: