Showing posts with label deleteduplicates. Show all posts
Showing posts with label deleteduplicates. Show all posts

Wednesday, September 23, 2015

How It Works: Union

Here is a recursive version of the set-theoretic function, Union, which returns the common elements of two or more sets (represented as Lists in Mathematica).

There are two base cases: 1) The union of any set with the empty set is the set itself, and 2) The union of any set with a single element set is the single element set appended to the other set. The recursive case is: 3) The union of any two Lists is the union of the first List with the First element of the second List (i.e. the single element base case), which, calling itself, recurses down the Rest of the second List until it hits the empty set base case (an empty List), at which point it unwinds back out to produce the overall union.

Union produces a set of unique elements  — no duplicates — so we need to de-dupe the List. An easy way to do that is Sort it and use a replacement Rule (we could also use the built-in DeleteDuplicates or How It Works: DeleteDuplicates).

I show two ways to control the Precedence of the abbreviate operators for ReplaceRepeated vs Postfix. Since with Precedence of 110 ReplaceRepeated is a little 'stickier' than Postfix at 70 Sort will stick to the ReplaceRepeated function unless we do something. So we can either enclose the compound function in parentheses before applying ReplaceRepeated as in the 2nd base case or we can use Postfix with Slot (#) and pure Function (&) as typical in my extended Postfix syntax as in the recursive case.

Clear@myUnion;

myUnion[list1_List,{}]:=list1; (* 1st base case *)

myUnion[list1_List,element_]:=(If[MemberQ[list1,element],list1,list1/.{x___}->{x,element}] //Sort)//.{a___,b_,b_,c___}->{a,b,c}(* 2nd base case *)

myUnion[list1_List,list2_List]:=Module[{union},union=myUnion[myUnion[list1,First@list2],Rest@list2]//Sort//#//.{a___,b_,b_,c___}->{a,b,c}&
] (* recursive case *)


myUnion[{1,2,3,4},{}]

{1,2,3,4}

myUnion[{1,2,3,4},5]

{1,2,3,4,5}

myUnion[{1,2,3,4},2]

{1,2,3,4}

This shows the need to add a delete duplicates function:

myUnion[{1,2,2,3},2]

{1,2,2,3}

A simple replacement Rule does the job, but use ReplaceRepeated or only the first instance of a duplicate will be replaced:

{1,2,2,3,4,4}/.{a___,b_,b_,c___}->{a,b,c}

{1,2,3,4,4}

{4,4,1,3,3,2,2}//.{a___,b_,b_,c___}->{a,b,c}

This works without the delete duplicates function.

myUnion[{1,2,3,4},{3,4,5,6}]

{1,2,3,4,5,6}

But this would fail without it. First without delete duplicates.

myUnion[{1,2,2,3,4},{3,4,5}]

{1,2,2,3,4,5}

And now with delete duplicates.

myUnion[{1,2,2,3,4},{3,4,5}]

{1,2,3,4,5}

And the base case.

myUnion[{3,2,1,2,2,3},3]

{1,2,3}

And the recursive case.

myUnion[{1,2},{2,3,4}]

{1,2,3,4}

Wednesday, October 3, 2012

How It Works: DeleteDuplicates


Delete Duplicates

While Union is commonly used to select all unique elements from a List, including a set of Lists, DeleteDuplicates is commonly used to select unique elements from a single List, which Union can do, too. Union sorts the result, while DeleteDuplicates leaves the result in its original order. Consequently, DeleteDuplicates is a faster function if you do not need the Sort. Both functions include an optional second argument to specify the function used to remove duplicates, which greatly increases their power and versatility. First, here is DeleteDuplicates' basic functionality.

DeleteDuplicates@{c,a,b,d,a,c,a,e,e,a,a,e}

{c,a,b,d,e}

Note that if you do feed DeleteDuplicates a set of Lists, you do need to enclose the Lists in curly brackets.

DeleteDuplicates[{c,a,b},{d,a,c},{a,e},{e,a},{a,e}]

DeleteDuplicates::argb: DeleteDuplicates called with 5 arguments; between 1 and 3 arguments are expected. >>

DeleteDuplicates[{{c,a,b},{d,a,c},{a,e},{e,a},{a,e}}]

{{c,a,b},{d,a,c},{a,e},{e,a}}


You can use DeleteDuplicates' second argument to increase its breadth by specifying how it will detect the duplicates. So in the example above, by default neither Union nor DeleteDuplicates treats Lists with the same elements as equivalent, as would be the case in set theory, while this can be done with their sameness test.


It is relatively straightforward to construct the second argument if you keep in mind that the default is DeleteDuplicates[expr, SameQ] and therefore extensions of the function can take the form DeleteDuplicates[expr, f@#~SameQ~f#2&], where the comparison function f can be as complex as you wish. Here we need Sort because:

{a,b}~SameQ~{b,a}

False

DeleteDuplicates[{{c,a,b},{d,a,c},{a,e},{e,a},{a,e}},Sort@#~SameQ~Sort@#2&]

{{c,a,b},{d,a,c},{a,e}}

Here is a second, neat example from the Doc Center. Extending the power of DeleteDuplicates, this function uses Equal instead of SameQ, possibly since Equal will yield True for Reals and non-Reals. I've modified the example to show that.

5==5.

True

5===5.

False

list1 = {{0,0,0,1,0},{1,0,1,0,1},{1.,1.,1.,0.,0.},{0,0,0,0,1},{1,1,1,0,1}};

DeleteDuplicates[list1,Total@#==Total@#2&]

{{0,0,0,1,0},{1,0,1,0,1},{1,1,1,0,1}}

Here is a potential issue that limits the power of DeleteDuplicates. Trace tells us that the second 4 gets removed before it can be compared to the 16.

squaresList=Table[{x,x^2},{x,2,4}]//Flatten

{2,4,3,9,4,16}

DeleteDuplicates[squaresList,#2==#^2&]//Trace
{{squaresList,{2,4,3,9,4,16}},DeleteDuplicates[{2,4,3,9,4,16},#2==#1^2&],{(#2==#1^2&)[2,4],4==2^2,{2^2,4},4==4,True},{(#2==#1^2&)[2,3],3==2^2,{2^2,4},3==4,False},{(#2==#1^2&)[2,9],9==2^2,{2^2,4},9==4,False},{(#2==#1^2&)[2,4],4==2^2,{2^2,4},4==4,True},{(#2==#1^2&)[2,16],16==2^2,{2^2,4},16==4,False},{(#2==#1^2&)[3,9],9==3^2,{3^2,9},9==9,True},{(#2==#1^2&)[3,4],4==3^2,{3^2,9},4==9,False},{(#2==#1^2&)[3,16],16==3^2,{3^2,9},16==9,False},{2,3,16}}

We're asking DeleteDuplicates to do something beyond deleting duplicates. We should use DeleteCases to do this job.

DeleteCases[squaresList,x_/;(Sqrt@x//IntegerQ)]

{2,3}

How It Works

Somewhere Roman Maeder gives a solution to deleting duplicates and implies that his solution is efficient. From memory here is the solution (with my more efficient syntax). We create a simple list of duplicate integers.

dupeList = Table[{i, i}, {i, 10}] // Flatten

{1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10}

We partition them into sets of two with offset 1 (meaning overlap of one on the original List).

dupeListPartitioned2Offset1 = Partition[dupeList, 2, 1]

{{1, 1}, {1, 2}, {2, 2}, {2, 3}, {3, 3}, {3, 4}, {4, 4}, {4, 5}, {5, 5}, {5, 6}, {6, 6}, {6, 7}, {7, 7}, {7, 8}, {8, 8}, {8, 9}, {9, 9}, {9, 10}, {10, 10}}

Now it's a simple matter to Select the sets where Part 1 is not the same as Part 2. Select always takes a predicate, sometimes of the form testQ, but here with an abbreviated operator, UnsameQ. Unequal would work for numerical entries, but UnsameQ will also work for symbols and Strings.

dupeListdupeSetsDeleted = Select[dupeListPartitioned2Offset1, #[[1]] =!= #[[2]] &]

{{1, 2}, {2, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}}

We're still left with duplicates. So we take the First entry of each set, and Append the Last entry of the Last set, which would have been left out. I remember thinking at this point that I would have spent time trying to not 'hack' this last part--somehow capture that last entry without another operation--but if it's good enough for Maeder, it's good enough for me. The lesson is to do what is expedient and move on to the next task.

Append[First /@ dupeListdupeSetsDeleted, Last@Last@dupeListdupeSetsDeleted]

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

DeleteDuplicates