SQL: Deletes Duplicate Rows

DELETE FROM employee e

WHERE e.emp_id NOT IN (

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email

);

💡 Explanation:

This query deletes duplicate rows in the EMPLOYEE table, while keeping the row with the minimum emp_id for each combination of name and email.

The subquery identifies the first occurrence (based on the minimum emp_id) of each combination of name and email and keeps it.

The outer query deletes the rest of the rows where the emp_id is not in that list.

🧪 Test Before Deleting

You can verify the duplicates you’re about to delete with this query:

sql

Copy

Edit

SELECT name, email, COUNT(*)

FROM employee

GROUP BY name, email

HAVING COUNT(*) > 1;

This will give you a list of duplicates based on the name and email.

⚠️ Notes:

This method ensures you're keeping one row per duplicate group based on the emp_id — you can adjust it to keep other criteria (like the latest row or the one with the highest salary, etc.).

If your table has more columns and you need a more complex comparison for duplicates (e.g., on multiple columns), you can modify the GROUP BY clause accordingly.

Let me know if you need help adapting this to a more complex schema or business rule!

You said:

Explain me with sample data how this query will delete duplicate rows

ChatGPT said:

Absolutely! Let’s walk through a real example using sample data to show how the query works.

🔹 Sample Table: EMPLOYEE

EMP_ID NAME EMAIL DEPARTMENT

1 John Doe john.doe@email.com HR

2 Jane Smith jane.smith@email.com IT

3 John Doe john.doe@email.com HR

4 Alice Brown alice.brown@email.com Finance

5 John Doe john.doe@email.com HR

6 Jane Smith jane.smith@email.com IT

Here, we have duplicate rows based on the combination of NAME and EMAIL. For example:

There are 3 rows for John Doe with the same email (john.doe@email.com).

There are 2 rows for Jane Smith with the same email (jane.smith@email.com).

Our goal is to delete the duplicates, but keep one row per unique combination of NAME and EMAIL.

🔹 The Query

sql

Copy

Edit

DELETE FROM employee e

WHERE e.emp_id NOT IN (

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email

);

Let’s break it down step-by-step:

1. Subquery:

sql

Copy

Edit

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email;

This part of the query identifies the first occurrence of each duplicate group by selecting the minimum emp_id for each unique combination of NAME and EMAIL.

Output of the subquery:

NAME EMAIL MIN(EMP_ID)

John Doe john.doe@email.com 1

Jane Smith jane.smith@email.com 2

Alice Brown alice.brown@email.com 4

For John Doe and john.doe@email.com, the minimum emp_id is 1.

For Jane Smith and jane.smith@email.com, the minimum emp_id is 2.

For Alice Brown and alice.brown@email.com, the minimum emp_id is 4.

2. Main Query:

sql

Copy

Edit

DELETE FROM employee e

WHERE e.emp_id NOT IN (

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email

);

This deletes all rows where the emp_id is not in the list of minimum emp_ids returned by the subquery.

Applying the query:

It will keep the row with emp_id = 1 (the first row for John Doe).

It will keep the row with emp_id = 2 (the first row for Jane Smith).

It will keep the row with emp_id = 4 (the only row for Alice Brown).

It will delete the other rows for John Doe and Jane Smith because their emp_ids are not the minimum (3 and 5 for John Doe; 6 for Jane Smith).

🔹 Final Result

After running the query, the remaining rows in the EMPLOYEE table will be:

EMP_ID NAME EMAIL DEPARTMENT

1 John Doe john.doe@email.com HR

2 Jane Smith jane.smith@email.com IT

4 Alice Brown alice.brown@email.com Finance

Rows Deleted:

Row with emp_id = 3 for John Doe.

Row with emp_id = 5 for John Doe.

Row with emp_id = 6 for Jane Smith.

✅ Conclusion:

The query successfully keeps one row per duplicate group based on the NAME and EMAIL.

It deletes the extra duplicates while preserving the row with the minimum emp_id for each combination of NAME and EMAIL.

⚠️ What to Keep in Mind:

This query deletes duplicates based on a specific set of columns — in our case, NAME and EMAIL.

You can adjust the query to keep other criteria (e.g., the latest row or highest emp_id).

If you want to delete duplicates based on all columns, you can modify the GROUP BY clause to include all relevant columns.

Let me know if you need any further clarification or adjustments!

You said:

What if John Doe data is also duplicated

ChatGPT said:

Great question! Let's expand the scenario to handle the case where John Doe's data is duplicated across all columns, not just based on NAME and EMAIL.

🔹 New Scenario: All Columns are Considered for Duplicates

Suppose the EMPLOYEE table now looks like this, where all columns are duplicated for John Doe:

EMP_ID NAME EMAIL DEPARTMENT

1 John Doe john.doe@email.com HR

2 Jane Smith jane.smith@email.com IT

3 John Doe john.doe@email.com HR

4 Alice Brown alice.brown@email.com Finance

5 John Doe john.doe@email.com HR

6 Jane Smith jane.smith@email.com IT

In this case, we want to delete all duplicate rows for John Doe, considering all columns (NAME, EMAIL, DEPARTMENT) for comparison.

🔹 The Updated Query

To delete rows where all columns are duplicated, you can modify the GROUP BY clause to group by all the columns in the table:

sql

Copy

Edit

DELETE FROM employee e

WHERE e.emp_id NOT IN (

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email, department

);

💡 How This Query Works:

Subquery:

sql

Copy

Edit

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email, department;

This subquery groups rows by NAME, EMAIL, and DEPARTMENT, and selects the minimum emp_id for each unique combination of these columns.

For example, if there are multiple rows with the same NAME, EMAIL, and DEPARTMENT, it will select the row with the smallest emp_id.

Main Query:

sql

Copy

Edit

DELETE FROM employee e

WHERE e.emp_id NOT IN (

SELECT MIN(emp_id)

FROM employee

GROUP BY name, email, department

);

The outer query deletes rows where the emp_id is not in the list of minimum emp_ids returned by the subquery.

It ensures that only one row per unique combination of NAME, EMAIL, and DEPARTMENT is kept, and duplicates are deleted.

🔹 Sample Data: How the Query Will Work

Given the following table:

EMP_ID NAME EMAIL DEPARTMENT

1 John Doe john.doe@email.com HR

2 Jane Smith jane.smith@email.com IT

3 John Doe john.doe@email.com HR

4 Alice Brown alice.brown@email.com Finance

5 John Doe john.doe@email.com HR

6 Jane Smith jane.smith@email.com IT

The subquery would return:

NAME EMAIL DEPARTMENT MIN(EMP_ID)

John Doe john.doe@email.com HR 1

Jane Smith jane.smith@email.com IT 2

Alice Brown alice.brown@email.com Finance 4

What happens next?

For John Doe, the minimum emp_id is 1, so we keep the row with emp_id = 1 (the first occurrence of John Doe).

For Jane Smith, the minimum emp_id is 2, so we keep the row with emp_id = 2.

For Alice Brown, there is only one row, so we keep it.

The duplicate rows for John Doe (with emp_id = 3 and emp_id = 5) will be deleted.

🔹 Final Result:

After running the query, the table will look like this:

EMP_ID NAME EMAIL DEPARTMENT

1 John Doe john.doe@email.com HR

2 Jane Smith jane.smith@email.com IT

4 Alice Brown alice.brown@email.com Finance

Deleted Rows:

Row with emp_id = 3 for John Doe.

Row with emp_id = 5 for John Doe.

✅ Summary:

The query ensures that only one row per unique combination of NAME, EMAIL, and DEPARTMENT is kept.

It deletes any duplicate rows where all columns (NAME, EMAIL, DEPARTMENT) match.

⚠️ Additional Considerations:

Ordering of rows: If you want to keep a different row (e.g., the latest or the one with the highest emp_id), you can change the MIN(emp_id) to MAX(emp_id) in the subquery.

If you want to identify or prevent any remaining duplicates, you can run this query before deleting:

sql

Copy

Edit

SELECT name, email, department, COUNT(*)

FROM employee

GROUP BY name, email, department

HAVING COUNT(*) > 1;

This will list any combinations that are still duplicated.

SQL

Sunday, April 6, 2025

Deletes Duplicate Rows

No comments:

Post a Comment

SQL Interview Questions