Friday, May 29, 2015

String Processing: StringCases and StringMatchQ

I recently got inconsistent results from StringCases and StringMatchQ (and used Pick instead of Cases). When using Cases or Select, you typically debug issues by mapping MatchQ onto the List that you're sifting for patterns. Similarly, StringMatchQ is the tool used to debug StringCases. However - while I don't believe there is any difference between patterns accepted by MatchQ vs. Cases, there is a significant difference between patterns accepted by StringMatchQ vs. StringCases.

StringMatchQ will accept what are called 'abbreviated String metacharacters'. These are the asterisk wildcard "*" for 0 or more characters, the @ wildcard for one or more non-upper case characters ("@@@@" matches any 4-non-upper case character Expression), and double-backslashes for literal characters (e.g. \\* to match an asterisk).

But StringCases will not; it will only accept StringExpression patterns or RegularExpressions. Here is an example. I want to extract all the "VCathode*" values from this datafile header:

data = {"%", "cln1x", "V", "(V)", "@", "VCathode=2.75392", "V", "(V)", "@", "VCathode=2.29025", "V", "(V)", "@", "VCathode=1.86092", "V", "(V)", "@", "VCathode=1.59846", "V", "(V)", "@", "VCathode=1.40122", "V", "(V)", "@", "VCathode=1.28137", "V", "(V)", "@", "VCathode=1.15876", "V", "(V)", "@", "VCathode=1.06451", "V", "(V)", "@", "VCathode=1.0037", "V", "(V)", "@", "VCathode=0.950678", "V", "(V)", "@", "VCathode=0.881652"};

StringCases with an asterisk fails:

In[190]:= StringCases[data,"VCathode*"]
Out[190]= {{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}}

StringMatchQ with an asterisk works:

In[191]:= StringMatchQ[data,"VCathode*"]
Out[191]= {False,False,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True}

Here is the solution for StringCases using StringExpressions. I use Flatten to remove nulls ("{}").

In[192]:= StringCases[#,"VCathode"~~___]&/@data//Flatten
Out[192]= {VCathode=2.75392,VCathode=2.29025,VCathode=1.86092,VCathode=1.59846,VCathode=1.40122,VCathode=1.28137,VCathode=1.15876,VCathode=1.06451,VCathode=1.0037,VCathode=0.950678,VCathode=0.881652}

And here is the solution for StringCases using a RegularExpression. Here I use DeleteCases instead of Flatten to remove nulls, but leave sets of List brackets around the returned values.

In[195]:= StringCases[#,RegularExpression["VCathode.*"]]&/@data//DeleteCases[#,{}]&
Out[195]= {{VCathode=2.75392},{VCathode=2.29025},{VCathode=1.86092},{VCathode=1.59846},{VCathode=1.40122},{VCathode=1.28137},{VCathode=1.15876},{VCathode=1.06451},{VCathode=1.0037},{VCathode=0.950678},{VCathode=0.881652}}

Thank you to Xin Xiao from Wolfram Technical Support.



Wednesday, May 27, 2015

An Example Using Pick to Select Data by Pattern Instead of Cases, Select, or StringCases

Prefatory remark: To learn Mathematica efficiently you should focus on the most-frequently-used functions. I will provide a list of those soon. For most users Pick[] will not be a frequently-used function. You can learn "esoteric" Mathematica functions just for fun or because in your work certain esoteric functions are handy.

Here, nonetheless, is an example spawned from my not being able to get StringCases to work (solved and explained here), altho strangely StringMatchQ does work, which is the standard way to see if StringCases will work.

This is a file header from a parameter sweep I did in the finite element modeling program COMSOL. The goal is to select the different parameters used in the sweep, which are voltages such as "VCathode=2.29025"; I don't need all the other noise in there.

The way the command works is: 1) StringMatchQ is mapped onto data, which is a List of Strings, and returns True if the target matches the String with "*" wildcard, or False otherwise; 2) Pick then scans data to find instances matching the selection pattern. In this case since Pick's default selection pattern is True, I didn't need to use its 3rd argument that specifies a different selection pattern (such as "1" vs. "0").

data = {"%","cln1x","V","(V)","@","VCathode=2.75392","V","(V)","@","VCathode=2.29025","V","(V)","@","VCathode=1.86092","V","(V)","@","VCathode=1.59846","V","(V)","@","VCathode=1.40122","V","(V)","@","VCathode=1.28137","V","(V)","@","VCathode=1.15876","V","(V)","@","VCathode=1.06451","V","(V)","@","VCathode=1.0037","V","(V)","@","VCathode=0.950678","V","(V)","@","VCathode=0.881652","V","(V)","@","VCathode=0.836608","V","(V)","@","VCathode=0.795234","V","(V)","@","VCathode=0.753297","V","(V)","@","VCathode=0.73737","V","(V)","@","VCathode=0.725229","V","(V)","@","VCathode=0.703015","V","(V)","@","VCathode=0.68059","V","(V)","@","VCathode=0.646921","V","(V)","@","VCathode=0.62848"};

Pick[data,StringMatchQ[#,"VCathode*"]&/@data]

{VCathode=2.75392,VCathode=2.29025,VCathode=1.86092,VCathode=1.59846,VCathode=1.40122,VCathode=1.28137,VCathode=1.15876,VCathode=1.06451,VCathode=1.0037,VCathode=0.950678,VCathode=0.881652,VCathode=0.836608,VCathode=0.795234,VCathode=0.753297,VCathode=0.73737,VCathode=0.725229,VCathode=0.703015,VCathode=0.68059,VCathode=0.646921,VCathode=0.62848}

My next step will be to use Table and Join to insert the correct parameter in the header of the its matching data file from COMSOL.

Trivia: Pick was originally called BinarySelect and Stephen Wolfram re-named it, as is his want, to a short Anglo-Saxon term (a large metal rod is called a pick, or think of using toothpick to pick up food). I ran into him at a ice cream shop and complimented him on the function and asked if he had re-named it. When he was much younger Stephen ate a lot of chocolate and didn't take care of himself physically. But that has changed, he is careful with his diet and spends a lot of time on the treadmill staying in good physical condition.

Tuesday, May 19, 2015

Does Pi Occur in E - Find a Digit Sequence in any Number

Recently Stephen Wolfram and my friend and lab head, neurosurgeon and computer scientist Jeff Arle, were talking MyPiDay.com and Stephen said, "Let's write a program to find a number sequence, such as a birthday, in the sequence of Pi." Jeff said it took him just a minute. Here is the code, done as a couple of one-liners, and a sample birthday:

PiString=ToString[N[Pi,10^7]];

First@StringPosition[PiString,"421962"]

{33321,33326}

This reminded me of Carl Sagan's idea (see his novel Contact) that a message could be embedded in a transcendental number. You recall a transcendental number is a number that is irrational in any base and therefore has an infinite sequence. Since I see nothing that can be parameterized in a transcendental number, I don't see how anyone, even God (in this case the deistic creator or source of the universe), could embed a message in a transcendental number. On the other hand, you could specify a start location in any transcendental number for any message of any length, since any transcendental number contains all possible number sequences. (Similar short discussion here.)

Thus the sequence of any approximation to Pi is contained somewhere in E, and vice versa. The following is my program to find any number sequence in any other sequence.

Clear@findSequence;
findSequence[sequence_?NumericQ,sourceNumber_,maxDigits_Integer,base_Integer:10]:=

Module[
{stringSequence=StringReplace[ToString@sequence,"."->""],
numberDigitSequence,partitionLength,partitionedNumber,
sourceSequence={}(*initialize sourceSequence*),
startPositionOfSequence=1,
targetSequence,targetSequenceLength},

targetSequenceLength=StringLength@stringSequence;

targetSequence=RealDigits@ToExpression@stringSequence//First;
Print@targetSequence;

numberDigitSequence=N[sourceNumber,maxDigits]//RealDigits//First;
partitionedNumber=Partition[numberDigitSequence,targetSequenceLength,1];

(* The length of the partitioned sequence is the number of partitions against which to try to match the target sequence. startOfSequence is the Position in the source sequence of the first digit of the subsequence being matched. Note using While means only the first match is found if there are more than one. *)

While[startPositionOfSequence<=Length@partitionedNumber&&sourceSequence=!=targetSequence,sourceSequence=partitionedNumber[[startPositionOfSequence]];startPositionOfSequence++];

If[sourceSequence==targetSequence,StringForm["Sequence starts at position ``.",startPositionOfSequence-1 (*subtract 1 since first comparison is between empty List {} and targetSequence*)],"Sequence not found."]
]

Timing@findSequence[3.14159,E,10000000]

{3,1,4,1,5,9}
{9.328860,Sequence starts at position 1436936.}

I thought Contact was a fun read and great movie, but IMHO The Demon-Haunted World should be required reading for every scientist.

Tuesday, December 30, 2014

String Replacement Methods: StringTemplate

A String Replacement Overview is here.

StringTemplate

StringTemplate saves you the trouble of searching for a String subset within a String to replace or setting up your own marker to flag the StringPosition in the String at which to perform a replacement.

Further, good programming practice dictates that we use selectors and constructors – specialized, dedicated functions to extract a subset of a file or to change a subset of a file – and to always use those rather than ad hoc one liners scattered in our functions and programs.1,2 StringTemplate conveniently formalizes and enforces the use of selectors and constructors.

StringForm is simpler to understand and use than StringTemplate, so I use StringForm when you need to output a message from your function. I don't end the command with a semi-colon so you can see the InputForm of a TemplateObject including its default Options.

stringTemplate1=StringTemplate@"The quick brown `` jumped over the lazy white ``."

TemplateObject[{The quick brown ,TemplateSlot[1], jumped over the lazy white ,TemplateSlot[2],.},InsertionFunction->TextString,CombinerFunction->StringJoin]

You can directly Apply any StringTemplate as a function to a List of its arguments that fits its requirements, or use TemplateApply to do the same thing.

stringTemplate1@@{"mink","peccadillo"}

The quick brown mink jumped over the lazy white peccadillo.

Equivalently, here StringTemplate is used as a function as you would any other function – use it as the Head of an Expression with its arguments.

stringTemplate1["mink","peccadillo"]

The quick brown mink jumped over the lazy white peccadillo.

Equivalently, using TemplateApply:

TemplateApply[stringTemplate1,{"mink","peccadillo"}]

The quick brown mink jumped over the lazy white peccadillo.

1. Maeder, Roman, Computer Science with Mathematica. Cambridge: Cambridge University Press, 2000. Chapter 5.3. Design of Abstract Data Types.

2. Maeder, Roman, M220: Programming in Mathematica  (course given by Wolfram Education Group, which I have taken twice and recommend).

String Replacement Methods: Overview

Here are String replacement methods that I have used in code from one-liners up to programs producing hundreds of thousands of text and html files. In general, use the simplest method or one that you understand clearly. Use StringForm to output messages from your functions and programs. For longer functions or programs, StringTemplate is the new best practice.

There is a function I don't discuss, StringInsert, which inserts a substring at a given StringPosition in a control String. I don't advocate its use since it's very brittle in that if you add or delete even one character before the StringPosition then the insertion point will be wrong.

StringForm

Literal Replacement, Markers, and Delimiters

String Replacement Methods: Literal Replacement, Markers, and Delimiters

A String Replacement Overview is here.

Note that the next three methods all use StringReplace. This is in keeping with my principle that the fastest way to learn Mathematica is to become a power user of its 70 or so core functions. In String processing, for instance, StringInsert is not a function you need to know. Instead learn to use the more powerful and robust function, StringReplace.

Literal Replacement

Literal replacement works by using StringReplace to find a literal substring within a String and substitute another substring for it. Literal replacement is very simple and easy to use.

string1="The quick brown fox jumped over the lazy white dog.";

StringReplace[string1,{"fox"->"mink","dog"->"pecadillo"}]

The quick brown mink jumped over the lazy white pecadillo.

Markers

Using markers to indicate the replacement position can improve code legibility. Use StringReplace to replace just the marked text.

string2="The quick brown <animal1> jumped over the lazy white <animal2>.";

StringReplace[string2,{"<animal1>"->"mink","<animal2>"->"pecadillo"}]

The quick brown mink jumped over the lazy white pecadillo.

Delimiters

Use StringReplace to replace text between the delimiters. This is very useful when you want to replace a lot of text in a document, especially in a long document. However, the new function StringTemplate is a superior method overall.

sitemapTemplate="<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
<!-- put list of urls here with a line feed after each one -->
</urlset>";

urls="<url><loc>http://www.blah.net/page1.html</loc></url>
<url><loc>http://www.blah.net/page2.html</loc></url>";

Note that you use StringExpression (shorthand "~~") to concatenate quoted Strings with Blanks in the String to be found by StringReplace, but you must use StringJoin (shorthand "<>") if you concatenate different Strings in the replacement String.

sitemapTemplateWithURLs=StringReplace[sitemapTemplate,"<!-- put list"~~urlsList__~~"each one -->"->urls]

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.blah.net/page1.html</loc></url>
<url><loc>http://www.blah.net/page2.html</loc></url>
</urlset>

String Replacement Methods: StringForm

A String Replacement Overview is here.

StringForm

StringForm is a simple, elegant String template function. Use it in your functions where Print isn't enough since you need to fill in some variables, such as calculations on the fly. In a function use the following form with Print to output it since lines end with an output-suppressing semi-colon.

Print@StringForm["control string", variables];

Here the double backtick marks tell Mathematica where to fill in the blanks with arguments you give in the order in which they are inserted into the String. An argument can be a String or an Expression of unlimited complexity, which will be evaluated before insertion. If you don't want the inserted Expression to be evaluated, though, use HoldForm (example below).

StringForm["Use `` for relatively short and simple String templates, such as output messages in your functions. For example, the cube root of `` is ``.","StringForm",27,27^(1/3)]

Use StringForm for relatively short and simple String templates, such as output messages in your functions. For example, the cube root of 27 is 3.

If you're going to use an argument twice, switch the order, or use a number of arguments and you want to prevent mistakes, use numbered, rather than ordered, backticks. You often want a line break, for which the \n escape character is used within the quotation marks that are Mathematica's String delimiters.

StringForm["Flying or gliding mammals include `1`, `2`,\n`3`, `4`, `5`, `6`, and `7`.\nThe most common species are in the `3` family.","flying possums","greater glider","bats","flying squirrels","flying lemurs","flying monkeys","cats"]

Flying or gliding mammals include flying possums, greater glider,
bats, flying squirrels, flying lemurs, flying monkeys, and cats.
The most common species are in the bats family.

To prevent the inserted Expression from being evaluated, use HoldForm:

StringForm["For example, the sixth term of the Fibonacci series is the sum of the preceding two terms: ``.",HoldForm[1+1+2+3+5=8]]


For example, the sixth term of the Fibonacci series is the sum of the preceding two terms: 1+1+2+3+5=8.