postgres remove duplicates sql - ghdrako/doc_snipets GitHub Wiki
The standard approach is to GROUP BY on the duplicate columns and keep one remaining row using the MIN(id) or MAX(id) value.
Using row identity
BEGIN;
LOCK TABLE new_cust IN SHARE ROW EXCLUSIVE MODE;
CREATE TEMPORARY TABLE dups_cust AS
SELECT customerid, min(ctid) AS min_ctid
FROM new_cust
GROUP BY customerid
HAVING count(*) > 1;
DELETE FROM new_cust
USING dups_cust
WHERE new_cust.customerid = dups_cust.customerid
AND new_cust.ctid != dups_cust.min_ctid;
COMMIT;
VACUUM new_cust;
This simple way of deleting duplicate rows will not work when additional requirements have to be followed:
- Instead of deleting all duplicate rows, some should be kept. Duplicate rows may be a valid use-case for some applications but they should be limited to e.g. the last five ones created.
- The remaining row should not be the first or last one created. In some cases, additional columns set a priority to keep a row: A verified user should not be deleted to keep an unverified one.
Using window function
- The rows are partitioned by the columns indicating a duplicate row. For every combination of the specified columns a partition is automatically created to collect the duplicate rows.
- Every partition is sorted by a number of columns to mark their importance. If e.g. only the last five records should be kept, the partition's rows should be sorted by their creation date in descending order.
- The sorted rows within a partition are assigned an incrementing number by the ROW_NUMBER window function.
- Any row can be deleted according to the desired number of remaining rows. When e.g. only the last five rows should be kept, any row with a row number greater than five can be deleted.
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE FROM contacts
USING duplicates
WHERE contacts.id = duplicates.id AND duplicates.rownum > 1;
MySQL
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE contacts
FROM contacts
JOIN duplicates USING(id)
WHERE duplicates.rownum > 1