Wednesday, April 16, 2014

String Operations: The Essential Function StringSplit

Split a String and Gather its Words by Their First Letter

Let's process a nice sonnet I just read in Ray Kurzweil's excellent book on what he thinks is the heart of cognitive science and AI, How to Create a Mind: The Secret of Human Thought Revealed.
To start I copy and paste the text from the web between quotation marks.

In[377]:= sonnet73=
"That time of year thou mayst in me behold,
When yellow leaves, or none, or few, do hang
Upon those boughs which shake against the cold,
Bare ruined choirs, where late the sweet birds sang.
In me thou seest the twilight of such day,
As after sunset fadeth in the west,
Which by and by black night doth take away,
Death's second self, that seals up all in rest.
In me thou seest the glowing of such fire,
That on the ashes of his youth doth lie,
As the death-bed whereon it must expire,
Consumed with that which it was nourished by.
This thou perceiv'st, which makes thy love more strong,
To love that well, which thou must leave ere long.";

Can we use ToLowerCase without Map? Yes, since it's Listable.

In[380]:= sonnet73LowerCase=ToLowerCase@%377

Out[380]= that time of year thou mayst in me behold,
when yellow leaves, or none, or few, do hang
upon those boughs which shake against the cold,
bare ruined choirs, where late the sweet birds sang.
in me thou seest the twilight of such day,
as after sunset fadeth in the west,
which by and by black night doth take away,
death's second self, that seals up all in rest.
in me thou seest the glowing of such fire,
that on the ashes of his youth doth lie,
as the death-bed whereon it must expire,
consumed with that which it was nourished by.
this thou perceiv'st, which makes thy love more strong,
to love that well, which thou must leave ere long.

We need StringSplit, not SplitBy, since we are working with a String. StringSplit is a powerful and essential function when importing and transforming large files to the format you need. Examples in the Doc Center like this one show its versatility:

In[1]:= StringSplit["a-b:c-d:e-f-g",{":","-"}]

Out[1]= {a,b,c,d,e,f,g}

Using WordBoundary as the pattern test for where to split works, but leaves much more "noise". Here is the output. Using InputForm is often handy to see what is going in while processing Strings.

In[409]:= StringSplit[sonnet73LowerCase,WordBoundary]//InputForm

Out[409]//InputForm=
{"that", " ", "time", " ", "of", " ", "year", " ", "thou", " ", "mayst", " ", "in",
 " ", "me", " ", "behold", ",\n", "when", " ", "yellow", " ", "leaves", ", ", "or",
 " ", "none", ", ", "or", " ", "few", ", ", "do", " ", "hang", "\n", "upon", " ",
 "those", " ", "boughs", " ", "which", " ", "shake", " ", "against", " ", "the", " ",
 "cold", ",\n", "bare", " ", "ruined", " ", "choirs", ", ", "where", " ", "late", " ",
 "the", " ", "sweet", " ", "birds", " ", "sang", ".\n", "in", " ", "me", " ", "thou",
 " ", "seest", " ", "the", " ", "twilight", " ", "of", " ", "such", " ", "day", ",\n",
 "as", " ", "after", " ", "sunset", " ", "fadeth", " ", "in", " ", "the", " ", "west",
 ",\n", "which", " ", "by", " ", "and", " ", "by", " ", "black", " ", "night", " ",
 "doth", " ", "take", " ", "away", ",\n", "death", "'", "s", " ", "second", " ",
 "self", ", ", "that", " ", "seals", " ", "up", " ", "all", " ", "in", " ", "rest",
 ".\n", "in", " ", "me", " ", "thou", " ", "seest", " ", "the", " ", "glowing", " ",
 "of", " ", "such", " ", "fire", ",\n", "that", " ", "on", " ", "the", " ", "ashes",
 " ", "of", " ", "his", " ", "youth", " ", "doth", " ", "lie", ",\n", "as", " ", "the",
 " ", "death", "-", "bed", " ", "whereon", " ", "it", " ", "must", " ", "expire",
 ",\n", "consumed", " ", "with", " ", "that", " ", "which", " ", "it", " ", "was", " ",
 "nourished", " ", "by", ".\n", "this", " ", "thou", " ", "perceiv", "'", "st", ", ",
 "which", " ", "makes", " ", "thy", " ", "love", " ", "more", " ", "strong", ",\n",
 "to", " ", "love", " ", "that", " ", "well", ", ", "which", " ", "thou", " ", "must",
 " ", "leave", " ", "ere", " ", "long", "."}

What a mess! And using this lengthy construct to DeleteCases of punctuation still left an imperfect result (which I won't show in case you're about to eat).

%//DeleteCases[#,""|" "|","|", "|" ,"|"'"|"s"|"st"|",\n"|",\n"|".\n"|"\n"|"-"|"."|" . "|"  . "]&

Compare the result from proper use of StringSplit.

In[405]:= sonnet73Split=StringSplit[sonnet73LowerCase,{Whitespace,".",","}]//DeleteCases[#,""]&

Out[405]= {that,time,of,year,thou,mayst,in,me,behold,when,yellow,leaves,or,none,or,few,do,hang,upon,those,boughs,which,shake,against,the,cold,bare,ruined,choirs,where,late,the,sweet,birds,sang,in,me,thou,seest,the,twilight,of,such,day,as,after,sunset,fadeth,in,the,west,which,by,and,by,black,night,doth,take,away,death's,second,self,that,seals,up,all,in,rest,in,me,thou,seest,the,glowing,of,such,fire,that,on,the,ashes,of,his,youth,doth,lie,as,the,death-bed,whereon,it,must,expire,consumed,with,that,which,it,was,nourished,by,this,thou,perceiv'st,which,makes,thy,love,more,strong,to,love,that,well,which,thou,must,leave,ere,long}

Here GatherBy groups sub-lists by the first character of each word.

In[411]:= sonnet73SplitGathered=GatherBy[sonnet73Split,First@Characters@#&]

Out[411]= {{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{of,or,or,of,of,on,of},{year,yellow,youth},{mayst,me,me,me,must,makes,more,must},{in,in,in,in,in,it,it},{behold,boughs,bare,birds,by,by,black,by},{when,which,where,west,which,whereon,with,which,was,which,well,which},{leaves,late,lie,love,love,leave,long},{none,night,nourished},{few,fadeth,fire},{do,day,doth,death's,doth,death-bed},{hang,his},{upon,up},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{against,as,after,and,away,all,ashes,as},{cold,choirs,consumed},{ruined,rest},{glowing},{expire,ere},{perceiv'st}}

To see the Lists sorted by their initial letter, we should use SortBy, but this didn't work.

In[412]:= SortBy[%,First@Characters@#&]

We need to add another First to apply the sorting function to the First word in each sublist.

In[413]:= SortBy[%,First@Characters@First@#&]

Out[413]= {{against,as,after,and,away,all,ashes,as},{behold,boughs,bare,birds,by,by,black,by},{cold,choirs,consumed},{do,day,doth,death's,doth,death-bed},{expire,ere},{few,fadeth,fire},{glowing},{hang,his},{in,in,in,in,in,it,it},{leaves,late,lie,love,love,leave,long},{mayst,me,me,me,must,makes,more,must},{none,night,nourished},{of,or,or,of,of,on,of},{perceiv'st},{ruined,rest},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{upon,up},{when,which,where,west,which,whereon,with,which,was,which,well,which},{year,yellow,youth}}

1 comment:

  1. Im grateful for the blog.Really looking forward to read more. Cool.
    Mathematica

    ReplyDelete