postgres remove duplicates sql - ghdrako/doc_snipets GitHub Wiki

The standard approach is to GROUP BY on the duplicate columns and keep one remaining row using the MIN(id) or MAX(id) value.

Using row identity

BEGIN;
LOCK TABLE new_cust IN SHARE ROW EXCLUSIVE MODE;

CREATE TEMPORARY TABLE dups_cust AS
SELECT customerid, min(ctid) AS min_ctid
FROM new_cust
GROUP BY customerid
HAVING count(*) > 1;

DELETE FROM new_cust
USING dups_cust
WHERE new_cust.customerid = dups_cust.customerid
AND new_cust.ctid != dups_cust.min_ctid;

COMMIT;

VACUUM new_cust;

This simple way of deleting duplicate rows will not work when additional requirements have to be followed:

  • Instead of deleting all duplicate rows, some should be kept. Duplicate rows may be a valid use-case for some applications but they should be limited to e.g. the last five ones created.
  • The remaining row should not be the first or last one created. In some cases, additional columns set a priority to keep a row: A verified user should not be deleted to keep an unverified one.

Using window function

  1. The rows are partitioned by the columns indicating a duplicate row. For every combination of the specified columns a partition is automatically created to collect the duplicate rows.
  2. Every partition is sorted by a number of columns to mark their importance. If e.g. only the last five records should be kept, the partition's rows should be sorted by their creation date in descending order.
  3. The sorted rows within a partition are assigned an incrementing number by the ROW_NUMBER window function.
  4. Any row can be deleted according to the desired number of remaining rows. When e.g. only the last five rows should be kept, any row with a row number greater than five can be deleted.
WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER(
    PARTITION BY firstname, lastname, email
    ORDER BY age DESC
  ) AS rownum
  FROM contacts
)
DELETE FROM contacts
USING duplicates
WHERE contacts.id = duplicates.id AND duplicates.rownum > 1;

MySQL

WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER(
    PARTITION BY firstname, lastname, email
    ORDER BY age DESC
  ) AS rownum
  FROM contacts
)
DELETE contacts
FROM contacts
JOIN duplicates USING(id)
WHERE duplicates.rownum > 1