SQL – Finding Duplicate Records based certain criteria

SQL – Finding Duplicate Records based certain criteria

Problem Description:

I have these records in the table – employee_projects

idemployee_idproject_idstatus
1emp1proj1VERIFIED
2emp2proj2REJECTED
3emp1proj1VERIFIED
4emp1proj3REJECTED
5emp2proj2REQUIRED
6emp3proj4SUBMITTED
7emp4proj5VERIFIED
8emp4proj6VERIFIED
9emp3proj4REQUIRED

Here are the criteria for determining duplicates:

  1. Same employee ID, same project ID under the same status (Example: rows 1 and 3 are duplicates)
  2. Same employee ID, same project ID but in different status (Example: rows 6 and 9 are duplicates).
    An exception to duplication criteria#2 is if one project is REQUIRED and the same project is also REJECTED under the same employee, this is NOT considered a duplicate. For example, rows 2 and 5 are NOT duplicates.

I have a query for the first criterion:

select
    emp_id,
    proj_id,
    status,
    COUNT(*)
from
    employee_projects
group by
    emp_id,
    proj_id,
    status
having
    COUNT(*) > 1

What I’m struggling to construct is the SQL for the second criterion.

Solution – 1

maybe a self join can help you.

with t (employee_id ,project_id,status)
as
(
select 'emp1',  'proj1' ,   'VERIFIED'
Union all select 'emp2',    'proj2' ,   'REJECTED'
Union all select 'emp1',    'proj1' ,   'VERIFIED'
Union all select 'emp1',    'proj3' ,   'REJECTED'
Union all select 'emp2',    'proj2' ,   'REQUIRED'
Union all select 'emp3',    'proj4' ,   'SUBMITTED'
Union all select 'emp4',    'proj5' ,   'VERIFIED'
Union all select 'emp4',    'proj6' ,   'VERIFIED'
Union all select 'emp3',    'proj4' ,   'REQUIRED'
)
select
    t.employee_id,
    t.project_id,
    t.status,
    '' as status,
    'criteria#1' as SQL
from
    t
group by
    t.employee_id,
    t.project_id,
    t.status
having
    COUNT(*) > 1
union all
SELECT 
    t.employee_id,
    t.project_id,
    t.status,
    a.status,
    'criteria#2' as SQL
FROM
    t
    left join t as a on 
        t.employee_id = a.employee_id and
        t.project_id = a.project_id
where 
    t.status != a.status and
    concat(t.status,a.status) != 'REQUIREDREJECTED' and
    concat(t.status,a.status) != 'REJECTEDREQUIRED'

Solution – 2

Try the following:

select T.emp_id, T.proj_id, T.status, D.dup_cnt
from employee_projects T join
(
  select emp_id, proj_id, count(*) as dup_cnt
  from employee_projects
  group by emp_id, proj_id
  having count(*) > 1 and
    count(distinct case when status in ('REQUIRED', 'REJECTED') then status end) < 2
) D
on T.emp_id = D.emp_id and T.proj_id = D.proj_id
order by T.emp_id, T.proj_id

If you want to consider an employee with statuses (‘REQUIRED’, ‘REJECTED’, any other statuses) as duplicate, modify the having clause as the following:

select T.emp_id, T.proj_id, T.status, D.dup_cnt
from employee_projects T join
(
  select emp_id, proj_id, count(*) as dup_cnt
  from employee_projects
  group by emp_id, proj_id
  having count(*) > 1 and
    (count(distinct case when status in ('REQUIRED', 'REJECTED') then status end) < 2 or count(distinct status) > 2)
) D
on T.emp_id = D.emp_id and T.proj_id = D.proj_id
order by T.emp_id, T.proj_id

See a demo.

Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject