Showing posts with label SortBy. Show all posts
Showing posts with label SortBy. Show all posts

Wednesday, April 16, 2014

String Operations: The Essential Function StringSplit

Split a String and Gather its Words by Their First Letter

Let's process a nice sonnet I just read in Ray Kurzweil's excellent book on what he thinks is the heart of cognitive science and AI, How to Create a Mind: The Secret of Human Thought Revealed.
To start I copy and paste the text from the web between quotation marks.

In[377]:= sonnet73=
"That time of year thou mayst in me behold,
When yellow leaves, or none, or few, do hang
Upon those boughs which shake against the cold,
Bare ruined choirs, where late the sweet birds sang.
In me thou seest the twilight of such day,
As after sunset fadeth in the west,
Which by and by black night doth take away,
Death's second self, that seals up all in rest.
In me thou seest the glowing of such fire,
That on the ashes of his youth doth lie,
As the death-bed whereon it must expire,
Consumed with that which it was nourished by.
This thou perceiv'st, which makes thy love more strong,
To love that well, which thou must leave ere long.";

Can we use ToLowerCase without Map? Yes, since it's Listable.

In[380]:= sonnet73LowerCase=ToLowerCase@%377

Out[380]= that time of year thou mayst in me behold,
when yellow leaves, or none, or few, do hang
upon those boughs which shake against the cold,
bare ruined choirs, where late the sweet birds sang.
in me thou seest the twilight of such day,
as after sunset fadeth in the west,
which by and by black night doth take away,
death's second self, that seals up all in rest.
in me thou seest the glowing of such fire,
that on the ashes of his youth doth lie,
as the death-bed whereon it must expire,
consumed with that which it was nourished by.
this thou perceiv'st, which makes thy love more strong,
to love that well, which thou must leave ere long.

We need StringSplit, not SplitBy, since we are working with a String. StringSplit is a powerful and essential function when importing and transforming large files to the format you need. Examples in the Doc Center like this one show its versatility:

In[1]:= StringSplit["a-b:c-d:e-f-g",{":","-"}]

Out[1]= {a,b,c,d,e,f,g}

Using WordBoundary as the pattern test for where to split works, but leaves much more "noise". Here is the output. Using InputForm is often handy to see what is going in while processing Strings.

In[409]:= StringSplit[sonnet73LowerCase,WordBoundary]//InputForm

Out[409]//InputForm=
{"that", " ", "time", " ", "of", " ", "year", " ", "thou", " ", "mayst", " ", "in",
 " ", "me", " ", "behold", ",\n", "when", " ", "yellow", " ", "leaves", ", ", "or",
 " ", "none", ", ", "or", " ", "few", ", ", "do", " ", "hang", "\n", "upon", " ",
 "those", " ", "boughs", " ", "which", " ", "shake", " ", "against", " ", "the", " ",
 "cold", ",\n", "bare", " ", "ruined", " ", "choirs", ", ", "where", " ", "late", " ",
 "the", " ", "sweet", " ", "birds", " ", "sang", ".\n", "in", " ", "me", " ", "thou",
 " ", "seest", " ", "the", " ", "twilight", " ", "of", " ", "such", " ", "day", ",\n",
 "as", " ", "after", " ", "sunset", " ", "fadeth", " ", "in", " ", "the", " ", "west",
 ",\n", "which", " ", "by", " ", "and", " ", "by", " ", "black", " ", "night", " ",
 "doth", " ", "take", " ", "away", ",\n", "death", "'", "s", " ", "second", " ",
 "self", ", ", "that", " ", "seals", " ", "up", " ", "all", " ", "in", " ", "rest",
 ".\n", "in", " ", "me", " ", "thou", " ", "seest", " ", "the", " ", "glowing", " ",
 "of", " ", "such", " ", "fire", ",\n", "that", " ", "on", " ", "the", " ", "ashes",
 " ", "of", " ", "his", " ", "youth", " ", "doth", " ", "lie", ",\n", "as", " ", "the",
 " ", "death", "-", "bed", " ", "whereon", " ", "it", " ", "must", " ", "expire",
 ",\n", "consumed", " ", "with", " ", "that", " ", "which", " ", "it", " ", "was", " ",
 "nourished", " ", "by", ".\n", "this", " ", "thou", " ", "perceiv", "'", "st", ", ",
 "which", " ", "makes", " ", "thy", " ", "love", " ", "more", " ", "strong", ",\n",
 "to", " ", "love", " ", "that", " ", "well", ", ", "which", " ", "thou", " ", "must",
 " ", "leave", " ", "ere", " ", "long", "."}

What a mess! And using this lengthy construct to DeleteCases of punctuation still left an imperfect result (which I won't show in case you're about to eat).

%//DeleteCases[#,""|" "|","|", "|" ,"|"'"|"s"|"st"|",\n"|",\n"|".\n"|"\n"|"-"|"."|" . "|"  . "]&

Compare the result from proper use of StringSplit.

In[405]:= sonnet73Split=StringSplit[sonnet73LowerCase,{Whitespace,".",","}]//DeleteCases[#,""]&

Out[405]= {that,time,of,year,thou,mayst,in,me,behold,when,yellow,leaves,or,none,or,few,do,hang,upon,those,boughs,which,shake,against,the,cold,bare,ruined,choirs,where,late,the,sweet,birds,sang,in,me,thou,seest,the,twilight,of,such,day,as,after,sunset,fadeth,in,the,west,which,by,and,by,black,night,doth,take,away,death's,second,self,that,seals,up,all,in,rest,in,me,thou,seest,the,glowing,of,such,fire,that,on,the,ashes,of,his,youth,doth,lie,as,the,death-bed,whereon,it,must,expire,consumed,with,that,which,it,was,nourished,by,this,thou,perceiv'st,which,makes,thy,love,more,strong,to,love,that,well,which,thou,must,leave,ere,long}

Here GatherBy groups sub-lists by the first character of each word.

In[411]:= sonnet73SplitGathered=GatherBy[sonnet73Split,First@Characters@#&]

Out[411]= {{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{of,or,or,of,of,on,of},{year,yellow,youth},{mayst,me,me,me,must,makes,more,must},{in,in,in,in,in,it,it},{behold,boughs,bare,birds,by,by,black,by},{when,which,where,west,which,whereon,with,which,was,which,well,which},{leaves,late,lie,love,love,leave,long},{none,night,nourished},{few,fadeth,fire},{do,day,doth,death's,doth,death-bed},{hang,his},{upon,up},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{against,as,after,and,away,all,ashes,as},{cold,choirs,consumed},{ruined,rest},{glowing},{expire,ere},{perceiv'st}}

To see the Lists sorted by their initial letter, we should use SortBy, but this didn't work.

In[412]:= SortBy[%,First@Characters@#&]

We need to add another First to apply the sorting function to the First word in each sublist.

In[413]:= SortBy[%,First@Characters@First@#&]

Out[413]= {{against,as,after,and,away,all,ashes,as},{behold,boughs,bare,birds,by,by,black,by},{cold,choirs,consumed},{do,day,doth,death's,doth,death-bed},{expire,ere},{few,fadeth,fire},{glowing},{hang,his},{in,in,in,in,in,it,it},{leaves,late,lie,love,love,leave,long},{mayst,me,me,me,must,makes,more,must},{none,night,nourished},{of,or,or,of,of,on,of},{perceiv'st},{ruined,rest},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{upon,up},{when,which,where,west,which,whereon,with,which,was,which,well,which},{year,yellow,youth}}

Sunday, February 12, 2012

Sort, SortBy, Ordering, OrderedQ


This small family of functions is essential to programming in Mathematica. Indeed, sorting has been central to computer programming and algorithm analysis since programming's inception in the late 1950s.

See posts for:

Sort, SortBy, Ordering, OrderedQ

OrderingQ


OrderingQ

OrderingQ is the predicate of the family, which by default returns True of False if a List (or List of arguments of any function) is in "canonical" order, defined as lowest numerical or symbolic value to highest.

In[269]:= numericalList = Range@12 // RandomSample[#, 12] &

Out[269]= {11, 2, 5, 9, 1, 3, 8, 7, 4, 12, 10, 6}

In[270]:= OrderedQ@numericalList

Out[270]= False

Sort by default sorts from low to high.

In[271]:= Sort@numericalList

Out[271]= {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

In[272]:= OrderedQ@%

Out[272]= True

In[273]:= alphaList = CharacterRange["a", "z"] // RandomSample[#, 12] &

Out[273]= {"l", "d", "f", "x", "o", "z", "i", "y", "w", "s", "m", "g"}

In[274]:= OrderedQ@alphaList

Out[274]= False

Again, Sort by default sorts from low to high, technically using the ASCII code for the characters.

In[275]:= Sort@alphaList

Out[275]= {"d", "f", "g", "i", "l", "m", "o", "s", "w", "x", "y", "z"}

In[276]:= OrderedQ@%

Out[276]= True

In[277]:= ToCharacterCode@alphaList

Out[277]= {{108}, {100}, {102}, {120}, {111}, {122}, {105}, {121}, {119}, {115}, {109}, \
{103}}

In[278]:= Sort@%

Out[278]= {{100}, {102}, {103}, {105}, {108}, {109}, {111}, {115}, {119}, {120}, {121}, \
{122}}

We can give OrderedQ our own ordering function, as complex as we like, to decide if the List is ordered. In this example we simply override OrderedQ's default, which is normally in accord with Sort's of ordering from small to large, and ask OrderedQ if the List is ordered from large to small.

In[281]:= OrderedQ[Sort@numericalList, Greater]

Out[281]= False

[b]

Ordering


Ordering reads out the order of the elements of a List if they were sorted. So in the Table below, the first row is the list itself, the second row shows the list sorted, and the third row shows the Ordering of the list: the eighth element (1) would be first, the third element (2) would be second, the fifth element (3) would be third, the 4th element would be last, etc. Incidentally I use SeedRandom here so that if you evaluate the code you will get the same random sample as I got.

In[208]:= SeedRandom@135; aList = RandomSample[Range@10, 10]

Out[208]= {7, 5, 2, 10, 3, 6, 4, 1, 9, 8}

In[209]:= {aList, Sort@aList, Ordering@aList} //
 TableForm[#, TableHeadings -> {{"aList", "Sort", "Ordering"}, {}}] &





What are the Positions of the largest two elements in aList? Use this syntax:

In[210]:= Ordering[aList, {-2, -1}]

Out[210]= {9, 4}

SortBy


Let's say we want to Sort a List as if a function were applied to the List but return the original elements of the List. That is what distinguishes SortBy from Sort. For example, say you want to sort 10 rolls of a die 3 times by the number of 6s that come up.

In[265]:= rollADie = Table[RandomChoice[Range@6, 3], {10}]

Out[265]= {{3, 5, 4}, {2, 1, 5}, {6, 3, 4}, {6, 6, 6}, {5, 2, 2}, {1, 3, 1}, {5, 4,
  4}, {2, 4, 1}, {2, 2, 3}, {4, 1, 4}}

In[266]:= SortBy[rollADie, Count[#, 6] &] // Reverse

Out[266]= {{6, 6, 6}, {6, 3, 4}, {5, 4, 4}, {5, 2, 2}, {4, 1, 4}, {3, 5, 4}, {2, 4,
  1}, {2, 2, 3}, {2, 1, 5}, {1, 3, 1}}

A way to think about SortBy is to consider that it can only "see" the result of applying the function to the list elements and Sorts by that result. For instance, we can write more compact expressions for sorting by one List position than with Sort. Here SortBy only sees the 2nd element of each List and Sorts by it (in ascending order since that is Sort's default).

In[267]:= SortBy[rollADie, #[[2]] &]

Out[267]= {{2, 1, 5}, {4, 1, 4}, {2, 2, 3}, {5, 2, 2}, {1, 3, 1}, {6, 3, 4}, {2, 4,
  1}, {5, 4, 4}, {3, 5, 4}, {6, 6, 6}}

SortBy can take more than one sorting function and applies them in order to break ties. In this sense SortBy can be used like successive sorts in a database or program like Microsoft Excel used as a database.

In[268]:= SortBy[rollADie, {#[[2]] &, Last}]

Out[268]= {{4, 1, 4}, {2, 1, 5}, {5, 2, 2}, {2, 2, 3}, {1, 3, 1}, {6, 3, 4}, {2, 4,
  1}, {5, 4, 4}, {3, 5, 4}, {6, 6, 6}}

 SortBy[e,f] is equivalent to Sort[{f[#],#}&/@e][[All,-1]].

Last, the Doc Center unveils the function underlying SortBy: SortBy[e,f]is equivalent to Sort[{f[#],#}&/@e][[All,-1]]. Translated into English, this says "make a List of the function f applied to each argument and that argument, then Sort by the first element of the List (i.e. the function applied to the argument), then only return the 2nd element of each List (i.e., the original List elements, but now re-ordered by the Sort.

Sort


Sort is simple in concept: It sorts a List according to a function that dictates the ordering. The default ordering function is smaller to larger (the function Less).

In[216]:= SeedRandom@345; aList1 = RandomInteger[100, 10]

Out[216]= {84, 52, 80, 41, 53, 24, 22, 0, 47, 42}

In[218]:= {Sort@aList1, Sort[aList1, Less]} // TableForm





We can specify the ordering function that Sort uses.

In[219]:= Sort[aList1, Greater]

Out[219]= {84, 80, 53, 52, 47, 42, 41, 24, 22, 0}

The ordering function always takes two arguments and compares them. Here we Sort by the second element of each sublist.

In[222]:= Sort[{{a, 2, 85}, {c, 1, 0.61}, {d, 3, 5^5}}, #[[2]] < #2[[2]] &]

Out[222]= {{c, 1, 0.61}, {a, 2, 85}, {d, 3, 3125}}

Whatever the algorithmic magic used by Sort to do its job efficiently, note that Slot1 picks out the first element to compare to the second element, which is picked out by Slot2. Here I show a Slot3 as well.

{#1[[2]], #2[[2]], #3[[2]]} & @@ {{a, 2}, {c, 1}, {d, 3}}

{2, 1, 3}

So under the hood Sort must start by doing something like applying the ordering function as a predicate, but then in architecture using the latest, greatest sorting algorithm from the computational complexity literature.

#1[[2]] < #2[[2]] &[{a, 2}, {c, 1}]

False

As holds throughout Mathematica, Sort works on symbolic arguments. If the symbols had been assigned values, of course those values would be used.

In[228]:= Clear[d, b, c, a]; Sort@{d, b, c, a}

Out[228]= {a, b, c, d}

In[229]:= a = 7; b = 2;

Sort@{d, b, c, a}

Out[230]= {2, 7, c, d}