Wednesday, October 3, 2012

How It Works: DeleteDuplicates

Delete Duplicates

While Union is commonly used to select all unique elements from a List, including a set of Lists, DeleteDuplicates is commonly used to select unique elements from a single List, which Union can do, too. Union sorts the result, while DeleteDuplicates leaves the result in its original order. Consequently, DeleteDuplicates is a faster function if you do not need the Sort. Both functions include an optional second argument to specify the function used to remove duplicates, which greatly increases their power and versatility. First, here is DeleteDuplicates' basic functionality.



Note that if you do feed DeleteDuplicates a set of Lists, you do need to enclose the Lists in curly brackets.


DeleteDuplicates::argb: DeleteDuplicates called with 5 arguments; between 1 and 3 arguments are expected. >>



You can use DeleteDuplicates' second argument to increase its breadth by specifying how it will detect the duplicates. So in the example above, by default neither Union nor DeleteDuplicates treats Lists with the same elements as equivalent, as would be the case in set theory, while this can be done with their sameness test.

It is relatively straightforward to construct the second argument if you keep in mind that the default is DeleteDuplicates[expr, SameQ] and therefore extensions of the function can take the form DeleteDuplicates[expr, f@#~SameQ~f#2&], where the comparison function f can be as complex as you wish. Here we need Sort because:





Here is a second, neat example from the Doc Center. Extending the power of DeleteDuplicates, this function uses Equal instead of SameQ, possibly since Equal will yield True for Reals and non-Reals. I've modified the example to show that.





list1 = {{0,0,0,1,0},{1,0,1,0,1},{1.,1.,1.,0.,0.},{0,0,0,0,1},{1,1,1,0,1}};



Here is a potential issue that limits the power of DeleteDuplicates. Trace tells us that the second 4 gets removed before it can be compared to the 16.




We're asking DeleteDuplicates to do something beyond deleting duplicates. We should use DeleteCases to do this job.



How It Works

Somewhere Roman Maeder gives a solution to deleting duplicates and implies that his solution is efficient. From memory here is the solution (with my more efficient syntax). We create a simple list of duplicate integers.

dupeList = Table[{i, i}, {i, 10}] // Flatten

{1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10}

We partition them into sets of two with offset 1 (meaning overlap of one on the original List).

dupeListPartitioned2Offset1 = Partition[dupeList, 2, 1]

{{1, 1}, {1, 2}, {2, 2}, {2, 3}, {3, 3}, {3, 4}, {4, 4}, {4, 5}, {5, 5}, {5, 6}, {6, 6}, {6, 7}, {7, 7}, {7, 8}, {8, 8}, {8, 9}, {9, 9}, {9, 10}, {10, 10}}

Now it's a simple matter to Select the sets where Part 1 is not the same as Part 2. Select always takes a predicate, sometimes of the form testQ, but here with an abbreviated operator, UnsameQ. Unequal would work for numerical entries, but UnsameQ will also work for symbols and Strings.

dupeListdupeSetsDeleted = Select[dupeListPartitioned2Offset1, #[[1]] =!= #[[2]] &]

{{1, 2}, {2, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}}

We're still left with duplicates. So we take the First entry of each set, and Append the Last entry of the Last set, which would have been left out. I remember thinking at this point that I would have spent time trying to not 'hack' this last part--somehow capture that last entry without another operation--but if it's good enough for Maeder, it's good enough for me. The lesson is to do what is expedient and move on to the next task.

Append[First /@ dupeListdupeSetsDeleted, Last@Last@dupeListdupeSetsDeleted]

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}



  1. Hi. I think you can do similar computation with Union using SameTest options no? There is the structure:

    1. Hi Murta,

      Thanks for your note, I corrected the post.