Ad

SQL: How To Remove Duplicate Rows?

- 1 answer

I have the following data that has certain rows which are implicitly duplicated, but don't appear as such. I want to identify them based on certain conditions, compare them and then remove them.

| ID   | COUNTRY | NAME  | UNITS |
|------|---------|-------|-------|
| 1000 | Canada  | Andy  | 1     |
| 1000 | Canada  | Andy  | 2     |
| 1000 | USA     | Andy  | 3     |
| 1000 | China   | Bob   | 3     |
| 2000 | Canada  | Chase | 2     |
| 2000 | Canada  | David | 3     |
| 2000 | USA     | David | 4     |
| 3000 | Canada  | John  | 1     |
| 3000 | USA     | John  | 1     |
| 3000 | China   | Dave  | 3     |

Condition 1: For ID = 1000, Andy is shown to be part of Canada and USA. In this case, I want to compare the sum of units for the two countries and keep the first country (alphabetically) if the sum of the Units is same. If the sum of Units for USA is greater, then keep the second Country (e.g., USA)

Output:

| ID   | COUNTRY | NAME  | UNITS |
|------|---------|-------|-------|
| 1000 | Canada  | Andy  | 1     |
| 1000 | Canada  | Andy  | 2     |
| 1000 | China   | Bob   | 3     |

Condition 2: For ID = 2000 and name David, the Country with the most Units is USA. So, I want to keep the second row in that case.

Output:

| ID   | COUNTRY | NAME  | UNITS |
|------|---------|-------|-------|
| 2000 | Canada  | Chase | 2     |
| 2000 | USA     | David | 4     |

Condition 3: For ID = 3000, the units in both case is the same, so keep the first country alphabetically.

Output:

| ID   | COUNTRY | NAME  | UNITS |
|------|---------|-------|-------|
| 3000 | Canada  | John  | 1     |
| 3000 | China   | Dave  | 3     |

Final Output

| ID   | COUNTRY | NAME  | UNITS |
|------|---------|-------|-------|
| 1000 | Canada  | Andy  | 1     |
| 1000 | Canada  | Andy  | 2     |
| 1000 | China   | Bob   | 3     |
| 2000 | Canada  | Chase | 2     |
| 2000 | USA     | David | 4     |
| 3000 | Canada  | John  | 1     |
| 3000 | China   | Dave  | 3     |

DDL Statement

CREATE TABLE #tmptbl (
   id INT,
   type VARCHAR(20),
   name VARCHAR(20),
   qty INT
)

INSERT INTO #tmptbl VALUES 
(1000, 'Canada', 'Andy', 1),
(1000, 'Canada', 'Andy', 2),
(1000, 'USA', 'Andy', 3),
(1000, 'China', 'Bob', 3),
(2000, 'Canada', 'Chase', 2),
(2000, 'Canada', 'David', 3),
(2000, 'USA', 'David', 4),
(3000, 'Canada', 'John', 1),
(3000, 'USA', 'John', 1),
(3000, 'China', 'Dave', 3)
Ad

Answer

Seems like this is a spin on the classic Get top 1 row of each group, but where you need to first get the windowed SUM and instead use DENSE_RANK (or RANK):

WITH Totals AS
    (SELECT tt.ID,
            tt.Country,
            tt.Name,
            tt.Units,
            SUM(tt.Units) OVER (PARTITION BY tt.ID, tt.Name, tt.Country) AS TotalUnits
     FROM #tmptbl tt),
Ranks AS
    (SELECT T.ID,
            T.Country,
            T.Name,
            T.Units,
            T.TotalUnits,
            DENSE_RANK() OVER (PARTITION BY ID, Name ORDER BY T.TotalUnits DESC, T.Country ASC) AS DenseRank
     FROM Totals T)
SELECT R.ID,
       R.Country,
       R.Name,
       R.Units
FROM Ranks R
WHERE R.DenseRank = 1
ORDER BY ID,
         Name,
         Units;

db<>fiddle

Ad
source: stackoverflow.com
Ad