Friday, May 29, 2015

String Processing: StringCases and StringMatchQ

I recently got inconsistent results from StringCases and StringMatchQ (and used Pick instead of Cases). When using Cases or Select, you typically debug issues by mapping MatchQ onto the List that you're sifting for patterns. Similarly, StringMatchQ is the tool used to debug StringCases. However - while I don't believe there is any difference between patterns accepted by MatchQ vs. Cases, there is a significant difference between patterns accepted by StringMatchQ vs. StringCases.

StringMatchQ will accept what are called 'abbreviated String metacharacters'. These are the asterisk wildcard "*" for 0 or more characters, the @ wildcard for one or more non-upper case characters ("@@@@" matches any 4-non-upper case character Expression), and double-backslashes for literal characters (e.g. \\* to match an asterisk).

But StringCases will not; it will only accept StringExpression patterns or RegularExpressions. Here is an example. I want to extract all the "VCathode*" values from this datafile header:

data = {"%", "cln1x", "V", "(V)", "@", "VCathode=2.75392", "V", "(V)", "@", "VCathode=2.29025", "V", "(V)", "@", "VCathode=1.86092", "V", "(V)", "@", "VCathode=1.59846", "V", "(V)", "@", "VCathode=1.40122", "V", "(V)", "@", "VCathode=1.28137", "V", "(V)", "@", "VCathode=1.15876", "V", "(V)", "@", "VCathode=1.06451", "V", "(V)", "@", "VCathode=1.0037", "V", "(V)", "@", "VCathode=0.950678", "V", "(V)", "@", "VCathode=0.881652"};

StringCases with an asterisk fails:

In[190]:= StringCases[data,"VCathode*"]
Out[190]= {{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}}

StringMatchQ with an asterisk works:

In[191]:= StringMatchQ[data,"VCathode*"]
Out[191]= {False,False,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True,False,False,False,True}

Here is the solution for StringCases using StringExpressions. I use Flatten to remove nulls ("{}").

In[192]:= StringCases[#,"VCathode"~~___]&/@data//Flatten
Out[192]= {VCathode=2.75392,VCathode=2.29025,VCathode=1.86092,VCathode=1.59846,VCathode=1.40122,VCathode=1.28137,VCathode=1.15876,VCathode=1.06451,VCathode=1.0037,VCathode=0.950678,VCathode=0.881652}

And here is the solution for StringCases using a RegularExpression. Here I use DeleteCases instead of Flatten to remove nulls, but leave sets of List brackets around the returned values.

In[195]:= StringCases[#,RegularExpression["VCathode.*"]]&/@data//DeleteCases[#,{}]&
Out[195]= {{VCathode=2.75392},{VCathode=2.29025},{VCathode=1.86092},{VCathode=1.59846},{VCathode=1.40122},{VCathode=1.28137},{VCathode=1.15876},{VCathode=1.06451},{VCathode=1.0037},{VCathode=0.950678},{VCathode=0.881652}}

Thank you to Xin Xiao from Wolfram Technical Support.



No comments:

Post a Comment