The Way of Mathematica: 2014

Tuesday, December 30, 2014

String Replacement Methods: StringTemplate

A String Replacement Overview is here.

StringTemplate

StringTemplate saves you the trouble of searching for a String subset within a String to replace or setting up your own marker to flag the StringPosition in the String at which to perform a replacement.

Further, good programming practice dictates that we use selectors and constructors – specialized, dedicated functions to extract a subset of a file or to change a subset of a file – and to always use those rather than ad hoc one liners scattered in our functions and programs.^1,2 StringTemplate conveniently formalizes and enforces the use of selectors and constructors.

StringForm is simpler to understand and use than StringTemplate, so I use StringForm when you need to output a message from your function. I don't end the command with a semi-colon so you can see the InputForm of a TemplateObject including its default Options.

stringTemplate1=StringTemplate@"The quick brown `` jumped over the lazy white ``."

TemplateObject[{The quick brown ,TemplateSlot[1], jumped over the lazy white ,TemplateSlot[2],.},InsertionFunction->TextString,CombinerFunction->StringJoin]

You can directly Apply any StringTemplate as a function to a List of its arguments that fits its requirements, or use TemplateApply to do the same thing.

stringTemplate1@@{"mink","peccadillo"}

The quick brown mink jumped over the lazy white peccadillo.

Equivalently, here StringTemplate is used as a function as you would any other function – use it as the Head of an Expression with its arguments.

stringTemplate1["mink","peccadillo"]

The quick brown mink jumped over the lazy white peccadillo.

Equivalently, using TemplateApply:

TemplateApply[stringTemplate1,{"mink","peccadillo"}]

The quick brown mink jumped over the lazy white peccadillo.

1. Maeder, Roman, Computer Science with Mathematica. Cambridge: Cambridge University Press, 2000. Chapter 5.3. Design of Abstract Data Types.

2. Maeder, Roman, M220: Programming in Mathematica (course given by Wolfram Education Group, which I have taken twice and recommend).

String Replacement Methods: Overview

Here are String replacement methods that I have used in code from one-liners up to programs producing hundreds of thousands of text and html files. In general, use the simplest method or one that you understand clearly. Use StringForm to output messages from your functions and programs. For longer functions or programs, StringTemplate is the new best practice.

There is a function I don't discuss, StringInsert, which inserts a substring at a given StringPosition in a control String. I don't advocate its use since it's very brittle in that if you add or delete even one character before the StringPosition then the insertion point will be wrong.

StringForm

Literal Replacement, Markers, and Delimiters

StringTemplate

String Replacement Methods: Literal Replacement, Markers, and Delimiters

A String Replacement Overview is here.

Note that the next three methods all use StringReplace. This is in keeping with my principle that the fastest way to learn Mathematica is to become a power user of its 70 or so core functions. In String processing, for instance, StringInsert is not a function you need to know. Instead learn to use the more powerful and robust function, StringReplace.

Literal Replacement

Literal replacement works by using StringReplace to find a literal substring within a String and substitute another substring for it. Literal replacement is very simple and easy to use.

string1="The quick brown fox jumped over the lazy white dog.";

StringReplace[string1,{"fox"->"mink","dog"->"pecadillo"}]

The quick brown mink jumped over the lazy white pecadillo.

Markers

Using markers to indicate the replacement position can improve code legibility. Use StringReplace to replace just the marked text.

string2="The quick brown <animal1> jumped over the lazy white <animal2>.";

StringReplace[string2,{"<animal1>"->"mink","<animal2>"->"pecadillo"}]

The quick brown mink jumped over the lazy white pecadillo.

Delimiters

Use StringReplace to replace text between the delimiters. This is very useful when you want to replace a lot of text in a document, especially in a long document. However, the new function StringTemplate is a superior method overall.

sitemapTemplate="<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">

</urlset>";

urls="<url><loc>http://www.blah.net/page1.html</loc></url>
<url><loc>http://www.blah.net/page2.html</loc></url>";

Note that you use StringExpression (shorthand "~~") to concatenate quoted Strings with Blanks in the String to be found by StringReplace, but you must use StringJoin (shorthand "<>") if you concatenate different Strings in the replacement String.

sitemapTemplateWithURLs=StringReplace[sitemapTemplate,""->urls]

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.blah.net/page1.html</loc></url>
<url><loc>http://www.blah.net/page2.html</loc></url>
</urlset>

String Replacement Methods: StringForm

A String Replacement Overview is here.

StringForm

StringForm is a simple, elegant String template function. Use it in your functions where Print isn't enough since you need to fill in some variables, such as calculations on the fly. In a function use the following form with Print to output it since lines end with an output-suppressing semi-colon.

Print@StringForm["control string", variables];

Here the double backtick marks tell Mathematica where to fill in the blanks with arguments you give in the order in which they are inserted into the String. An argument can be a String or an Expression of unlimited complexity, which will be evaluated before insertion. If you don't want the inserted Expression to be evaluated, though, use HoldForm (example below).

StringForm["Use `` for relatively short and simple String templates, such as output messages in your functions. For example, the cube root of `` is ``.","StringForm",27,27^(1/3)]

Use StringForm for relatively short and simple String templates, such as output messages in your functions. For example, the cube root of 27 is 3.

If you're going to use an argument twice, switch the order, or use a number of arguments and you want to prevent mistakes, use numbered, rather than ordered, backticks. You often want a line break, for which the \n escape character is used within the quotation marks that are Mathematica's String delimiters.

StringForm["Flying or gliding mammals include `1`, `2`,\n`3`, `4`, `5`, `6`, and `7`.\nThe most common species are in the `3` family.","flying possums","greater glider","bats","flying squirrels","flying lemurs","flying monkeys","cats"]

Flying or gliding mammals include flying possums, greater glider,
bats, flying squirrels, flying lemurs, flying monkeys, and cats.
The most common species are in the bats family.

To prevent the inserted Expression from being evaluated, use HoldForm:

StringForm["For example, the sixth term of the Fibonacci series is the sum of the preceding two terms: ``.",HoldForm[1+1+2+3+5=8]]

For example, the sixth term of the Fibonacci series is the sum of the preceding two terms: 1+1+2+3+5=8.

Tuesday, December 23, 2014

Memory Management Tools

While Mathematica is designed to manage memory for you, under certain circumstances it can get bogged down, mainly because it keeps a record of all your inputs and outputs with In and Out. So if you're using functions that output a lot of computation, or working with large files, you may notice Mathematica slowing down.

There are a number of ways that you can manage memory in Mathematica. Here is a summary (see also How to Find Memory Used in Computations).

Command	Effect
?Global`*	Shows all Symbols in a non-accessible table
Names@”Global`*”	Returns a List of all Symbols that you can access
Clear@symbol	Clears the value of symbol but leaves its name in memory
Clear@”Global`*”	Clears the values of all Symbols but leaves their names in memory
Remove@symbol	Removes the name symbol and its value from memory
Remove@”Global`*”	Removes all Symbols and their values from memory

If you're going to go as far as removing all Global Symbols, consider starting a new session by entering Quit[] in your Notebook or Quit Kernel → Local under the Evaluation menu.

Beginners hesitate to Quit the kernel, but there's little downside. Even if you haven't saved your Notebooks, the kernel is a separate entity and you can save them.

To automate resuming after quitting the kernel or in general, use Initialization Cells. You can set Initialization in the menu under Cell → Cell Properties or by right-clicking on the cell and selecting Initialization Cell. A little downward tick mark appears in the upper right corner of the cell.

Then when you re-start the kernel by selecting any cell, selecting Evaluation → Evaluate Initialization Cells, or re-open the Notebook, all the Initialization cells are automatically re-Evaluated. In this way you lose very little time by quitting the kernel and re-starting.

Memory-Management Commands to Use Occasionally

Memory currently used by the kernel:

In[157]:= MemoryInUse[]

Out[157]= 135450976

Memory currently used by the front end (all of your open Notebooks):

In[158]:= MemoryInUse@$FrontEnd

Out[158]= 543264768

The maximum memory used by the kernel during your current Mathematica session:

In[159]:= MaxMemoryUsed[]

Out[159]= 137155304

Clear a cell that consumed lots of memory in your session:

Unprotect[Out]; Out[537] =.;
Protect@Out;

Easy Ways to Create Variations of Function

You will often need to create more than one version of a function. Here are two different methods, a piecewise function and using an Option. Either are fine, but in general simply for conciseness I use a piecewise function when the function is short (like less than a dozen lines) and an Option when the function is longer.

Here is the dataset for the examples. To see the 6 functions used to create arrays in Mathematica, see Ways to Create Arrays.

dataset=Array[List,{5,3}]

{{{1,1},{1,2},{1,3}},{{2,1},{2,2},{2,3}},{{3,1},{3,2},{3,3}},{{4,1},{4,2},{4,3}},{{5,1},{5,2},{5,3}}}

Here's the first version. It takes a List of data as a first argument and an Integer as its second argument. This simple example uses the index to pick out a Part from the data.

function1[data_List,index_Integer]:=data[[index]]

function1[dataset,4]

{{4,1},{4,2},{4,3}}

The Piecewise Approach

Now we need a second version to take a List of indices in the second argument. We can simply use datatyping by specifying the Head of the second argument. This is one way to do the "piecewise" method. The domain of possible inputs is split into pieces (sub-domains) and each variation of a function is designed to pick out the correct piece (sub-domain) for which it is designed.

function2[data_List,index_List]:=data[[index]]

function2[dataset,{2,4}]

{{{2,1},{2,2},{2,3}},{{4,1},{4,2},{4,3}}}

The Options Approach

You can see that if there were a minor change in a long function, copying the function is not as concise as the following approach using Options. There is more overhead in this example to create the Option and handle it in the function, but in a long function that overhead is less than copying as in the Piecewise approach.

Note that for concise creation of Piecewise mathematical functions, the built-in function Piecewise should be used. For an explanation of how I use Options, see A Template for Optional Arguments [to be published].

ClearAll@function3;
Options@function3="indexOption"->"Integer";

function3[data_List,index_,options___?OptionQ]:=
Module[{indexOption="indexOption"/.{options}/.Options@function3},

If[indexOption=="Integer",dataset[[index]] ];
If[indexOption=="List",dataset[[{index}]] ];

(*endModule*)]

Here is the function that prompted me to write this primer. It takes the membrane voltage time series from simulated spiking neuron cells and counts how many spikes occur, with cell range and time series range as arguments. I needed to expand it so I could specify counting spikes in several time series ranges. I didn't want to copy and vary it with the piecewise approach, so I used the options approach. The changes to the original function to implement the new ones are italicized.

ClearAll@batchMeanFiringRateTable;
Options@batchMeanFiringRateTable={"Export"->Off,"MultipleBatch"->Off,"MultipleTimeSeries"->False};

batchMeanFiringRateTable[spikeFileDirectory_String,cellRange_List,timeSeriesRange_List,leftColumnHeading_String:"Neural Group",rightColumnHeading_String:"Average Firing Rate",options___?OptionQ]:=
Module[{exportFileName,firingData,
spikeFiles=getSpikeFiles@spikeFileDirectory,

exportOption="Export"/.{options}/.Options@batchMeanFiringRateTable,
multipleBatchOption="MultipleBatch"/.{options}/.Options@batchMeanFiringRateTable,
multipleTimeSeriesOption="MultipleTimeSeries"/.{options}/.Options@batchMeanFiringRateTable},

(*With "Batch"\[Rule]On, meanFiringRateFromSpikeFile will just return the averageSpikeRate each time it's called*)

(*Need FileBaseName to identify cell group in left column if TableForm; need First@timeSeriesRange to remove the outer List for a single time series range*)

If[multipleTimeSeriesOption==False,firingData=Table[{FileBaseName@batchFile,meanFiringRateFromSpikeFile[batchFile,cellRange,First@timeSeriesRange,"Batch"->On]},{batchFile,spikeFiles}] (*endIf*)];

If [multipleTimeSeriesOption==True,firingData=Table[{FileBaseName@batchFile,meanFiringRateFromSpikeFile[batchFile,cellRange,timeSeries,"Batch"->On]},{batchFile,spikeFiles},{timeSeries,timeSeriesRange}]/.{{x_,y_},{x_,z_}}->{x,y,z};(*Table will take each spike file and iterate over timeSeriesRange. Don't need First@timeSeriesRange here. *) (*endIf*)];

If[multipleBatchOption==Off,Print@"Multiple batch is off.";Print[firingData//TableForm[#,TableAlignments->Left,TableHeadings->{None,{leftColumnHeading,"Ave AP Rate: "<>ToString@timeSeriesRange}}]&],Return@firingData (*endIf*)];

If[exportOption==On&&multipleBatchOption==Off,exportFileName=FileNameJoin@{DirectoryName@spikeFiles[[1]],ToString@FileNameTake[DirectoryName@spikeFiles[[1]],-1]<>"-BatchFileTable.xlsx"};
Export[exportFileName,firingData];Print@StringForm["File exported to ``.",exportFileName]];
]

batchMeanFiringRateTable["C:\\Users\\Public\\Documents\\UNCuS16_09_2013\\DataFiles\\WDR-Abeta-14-12only\\WDR-Abeta-14-12only-CS8-9-20-RS0p28\\",{1,80},{{1,45},{46,85}},"Neural Group","Method"->"OverSpikingCells","MultipleBatch"->Off,"Export"->Off,"MultipleTimeSeries"->True]

Ways to Create Arrays

Programmers often have to create an array. Of course, in Mathematica, you don't have to allocate memory for an array or dimension it or any of that outdated nonsense. Here are the six standard functions and common methods used to create arrays in Mathematica. Use the easiest-to-implement and most readable method for your application.

First there is the versatile function Array that applies a function to the array indices of any desired array dimension.

Array[g,{2,3}]

{{g[1,1],g[1,2],g[1,3]},{g[2,1],g[2,2],g[2,3]}}

To simply create a nested List of Integers you can use List as the function to apply to the array. "Give me 5 rows of 3 elements each:"

dataset=Array[List,{5,3}]

{{{1,1},{1,2},{1,3}},{{2,1},{2,2},{2,3}},{{3,1},{3,2},{3,3}},{{4,1},{4,2},{4,3}},{{5,1},{5,2},{5,3}}}

It is quite easy to create a similar array with one of the most versatile and commonly-used functions in functional programming, Table. It pays handsomely to become a power user of Table.

Table[{i,j},{i,5},{j,3}]

{{{1,1},{1,2},{1,3}},{{2,1},{2,2},{2,3}},{{3,1},{3,2},{3,3}},{{4,1},{4,2},{4,3}},{{5,1},{5,2},{5,3}}}

Table is highly versatile and can certainly be used to do many things. In some cases there is a built-in function that is more specialized. For instance, Table can be used to create a constant array. "Give me 25 7s:"

Table[7,{25}]

{7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7}

ConstantArray is a specialized function that does the same thing, but except for a little bit of clarity in reading the code doesn't do more than Table.

ConstantArray[7,25]

{7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7}

Table can generate random numbers using any of the random function family:

Table[RandomInteger[],{10}]

{1,1,1,0,0,0,0,1,1,1}

But for simple as well as higher-dimension arrays, the several half-dozen built-in random functions are superior and more readable than doing the same thing in Table.

RandomInteger[{1,5},{2,3}]

{{2,4,3},{2,1,3}}

Likewise, while Table can easily create a range of numbers, Range is superior.

Table[i,{i,7}]

{1,2,3,4,5,6,7}

Range@7

{1,2,3,4,5,6,7}

From 14 to 28 in steps of 2:

Range[14,28,2]

{14,16,18,20,22,24,26,28}

Those applications yield vectors, but here is a lesser-known usage of Range with a List as argument that generates an array. For each number in the List, Range yields a List of integers from 1 up to that number.

Range[{2,5,3}]

{{1,2},{1,2,3,4,5},{1,2,3}}

To more carefully control aspects of the array such as start, end, and step size, you may need to manually assemble the array with Table or Range.

{Range[14,28,2],Range[10^6,10^5,-2*10^5]}

{{14,16,18,20,22,24,26,28},{1000000,800000,600000,400000,200000}}

{Table[i,{i,14,28,2}],Table[j,{j,10^6,10^5,-2*10^5}]}

{{14,16,18,20,22,24,26,28},{1000000,800000,600000,400000,200000}}

Finally, MapIndexed is used in general to generate a List where a function is applied to a List of arguments along with an index number, which is the second argument.

MapIndexed[f,{a,b,c,d}]

{f[a,{1}],f[b,{2}],f[c,{3}],f[d,{4}]}

Using First strips off the enclosing List of the index.

MapIndexed[{#^2,First@#2}&,Range@5]

{{1,1},{4,2},{9,3},{16,4},{25,5}}

Monday, May 26, 2014

How to Speed Up Processing Large Files

Here are five useful tips to speed up Mathematica, especially when working with large files.

First, use simple file formats such as .txt or .csv, thereby avoiding the complicated superstructure of, for example, Microsoft Excel. I have processed files with between 10^5 and 10^6 elements by saving an Excel file as CSV.

Second, memory is cheap — the simplest way is to increase the random access memory allocated to Mathematica. To increase RAM (Random Access Memory, which is 100 - 1000 times faster than standard hard drive memory, but solid state drives are changing that) allocated to Mathematica, the command is

ReinstallJava[JVMArguments->"Xmx6000m"]

where the final piece "6000m" tells your operating system to allocate more memory to the Java Virtual Machine, in this case 6000 MB (= 6 GB), but it can be as large as your RAM will tolerate.

I have several machines with 16 GB of RAM and one with 32 GB. I've needed up to 9 GB ("9000m") allocated to Mathematica for generating large plots of trees as well as processing large files.

First load the JLink Package and then reinstall Java with a memory spec:

Needs@"JLink`";
ReinstallJava[JVMArguments -> "-Xmx3000m"]

Whatever you get for Output is fine as long as there is no error message.

Third, instead of using Import, I have found that the lower-level ReadList function is one or two orders of magnitude faster. See http://reference.wolfram.com/mathematica/ref/ReadList.html. For instance, after trying Import on a 150 GB file and it never finished, I used ReadList and it took 72 seconds!

Fourth, get the data in parts if getting all the records at once is slow. See the Elements section of http://reference.wolfram.com/mathematica/ref/format/XLSX.html for details on getting parts of data. For example, I've found breaking a list of 10^6 records into sublists of 50K records works wonders.

Fifth, I don't know why, but Mathematica is slow at exporting HTML format. When working with thousands of files It is far quicker to save them as text and then change the file extensions with a shell command such as

ren *.txt *.html

Wednesday, April 16, 2014

String Operations: The Essential Function StringSplit

Split a String and Gather its Words by Their First Letter

Let's process a nice sonnet I just read in Ray Kurzweil's excellent book on what he thinks is the heart of cognitive science and AI, How to Create a Mind: The Secret of Human Thought Revealed.
To start I copy and paste the text from the web between quotation marks.

In[377]:= sonnet73=
"That time of year thou mayst in me behold,
When yellow leaves, or none, or few, do hang
Upon those boughs which shake against the cold,
Bare ruined choirs, where late the sweet birds sang.
In me thou seest the twilight of such day,
As after sunset fadeth in the west,
Which by and by black night doth take away,
Death's second self, that seals up all in rest.
In me thou seest the glowing of such fire,
That on the ashes of his youth doth lie,
As the death-bed whereon it must expire,
Consumed with that which it was nourished by.
This thou perceiv'st, which makes thy love more strong,
To love that well, which thou must leave ere long.";

Can we use ToLowerCase without Map? Yes, since it's Listable.

In[380]:= sonnet73LowerCase=ToLowerCase@%377

Out[380]= that time of year thou mayst in me behold,
when yellow leaves, or none, or few, do hang
upon those boughs which shake against the cold,
bare ruined choirs, where late the sweet birds sang.
in me thou seest the twilight of such day,
as after sunset fadeth in the west,
which by and by black night doth take away,
death's second self, that seals up all in rest.
in me thou seest the glowing of such fire,
that on the ashes of his youth doth lie,
as the death-bed whereon it must expire,
consumed with that which it was nourished by.
this thou perceiv'st, which makes thy love more strong,
to love that well, which thou must leave ere long.

We need StringSplit, not SplitBy, since we are working with a String. StringSplit is a powerful and essential function when importing and transforming large files to the format you need. Examples in the Doc Center like this one show its versatility:

In[1]:= StringSplit["a-b:c-d:e-f-g",{":","-"}]

Out[1]= {a,b,c,d,e,f,g}

Using WordBoundary as the pattern test for where to split works, but leaves much more "noise". Here is the output. Using InputForm is often handy to see what is going in while processing Strings.

In[409]:= StringSplit[sonnet73LowerCase,WordBoundary]//InputForm

Out[409]//InputForm=
{"that", " ", "time", " ", "of", " ", "year", " ", "thou", " ", "mayst", " ", "in",
" ", "me", " ", "behold", ",\n", "when", " ", "yellow", " ", "leaves", ", ", "or",
" ", "none", ", ", "or", " ", "few", ", ", "do", " ", "hang", "\n", "upon", " ",
"those", " ", "boughs", " ", "which", " ", "shake", " ", "against", " ", "the", " ",
"cold", ",\n", "bare", " ", "ruined", " ", "choirs", ", ", "where", " ", "late", " ",
"the", " ", "sweet", " ", "birds", " ", "sang", ".\n", "in", " ", "me", " ", "thou",
" ", "seest", " ", "the", " ", "twilight", " ", "of", " ", "such", " ", "day", ",\n",
"as", " ", "after", " ", "sunset", " ", "fadeth", " ", "in", " ", "the", " ", "west",
",\n", "which", " ", "by", " ", "and", " ", "by", " ", "black", " ", "night", " ",
"doth", " ", "take", " ", "away", ",\n", "death", "'", "s", " ", "second", " ",
"self", ", ", "that", " ", "seals", " ", "up", " ", "all", " ", "in", " ", "rest",
".\n", "in", " ", "me", " ", "thou", " ", "seest", " ", "the", " ", "glowing", " ",
"of", " ", "such", " ", "fire", ",\n", "that", " ", "on", " ", "the", " ", "ashes",
" ", "of", " ", "his", " ", "youth", " ", "doth", " ", "lie", ",\n", "as", " ", "the",
" ", "death", "-", "bed", " ", "whereon", " ", "it", " ", "must", " ", "expire",
",\n", "consumed", " ", "with", " ", "that", " ", "which", " ", "it", " ", "was", " ",
"nourished", " ", "by", ".\n", "this", " ", "thou", " ", "perceiv", "'", "st", ", ",
"which", " ", "makes", " ", "thy", " ", "love", " ", "more", " ", "strong", ",\n",
"to", " ", "love", " ", "that", " ", "well", ", ", "which", " ", "thou", " ", "must",
" ", "leave", " ", "ere", " ", "long", "."}

What a mess! And using this lengthy construct to DeleteCases of punctuation still left an imperfect result (which I won't show in case you're about to eat).

%//DeleteCases[#,""|" "|","|", "|" ,"|"'"|"s"|"st"|",\n"|",\n"|".\n"|"\n"|"-"|"."|" . "|" . "]&

Compare the result from proper use of StringSplit.

In[405]:= sonnet73Split=StringSplit[sonnet73LowerCase,{Whitespace,".",","}]//DeleteCases[#,""]&

Out[405]= {that,time,of,year,thou,mayst,in,me,behold,when,yellow,leaves,or,none,or,few,do,hang,upon,those,boughs,which,shake,against,the,cold,bare,ruined,choirs,where,late,the,sweet,birds,sang,in,me,thou,seest,the,twilight,of,such,day,as,after,sunset,fadeth,in,the,west,which,by,and,by,black,night,doth,take,away,death's,second,self,that,seals,up,all,in,rest,in,me,thou,seest,the,glowing,of,such,fire,that,on,the,ashes,of,his,youth,doth,lie,as,the,death-bed,whereon,it,must,expire,consumed,with,that,which,it,was,nourished,by,this,thou,perceiv'st,which,makes,thy,love,more,strong,to,love,that,well,which,thou,must,leave,ere,long}

Here GatherBy groups sub-lists by the first character of each word.

In[411]:= sonnet73SplitGathered=GatherBy[sonnet73Split,First@Characters@#&]

Out[411]= {{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{of,or,or,of,of,on,of},{year,yellow,youth},{mayst,me,me,me,must,makes,more,must},{in,in,in,in,in,it,it},{behold,boughs,bare,birds,by,by,black,by},{when,which,where,west,which,whereon,with,which,was,which,well,which},{leaves,late,lie,love,love,leave,long},{none,night,nourished},{few,fadeth,fire},{do,day,doth,death's,doth,death-bed},{hang,his},{upon,up},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{against,as,after,and,away,all,ashes,as},{cold,choirs,consumed},{ruined,rest},{glowing},{expire,ere},{perceiv'st}}

To see the Lists sorted by their initial letter, we should use SortBy, but this didn't work.

In[412]:= SortBy[%,First@Characters@#&]

We need to add another First to apply the sorting function to the First word in each sublist.

In[413]:= SortBy[%,First@Characters@First@#&]

Out[413]= {{against,as,after,and,away,all,ashes,as},{behold,boughs,bare,birds,by,by,black,by},{cold,choirs,consumed},{do,day,doth,death's,doth,death-bed},{expire,ere},{few,fadeth,fire},{glowing},{hang,his},{in,in,in,in,in,it,it},{leaves,late,lie,love,love,leave,long},{mayst,me,me,me,must,makes,more,must},{none,night,nourished},{of,or,or,of,of,on,of},{perceiv'st},{ruined,rest},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{upon,up},{when,which,where,west,which,whereon,with,which,was,which,well,which},{year,yellow,youth}}

File Operations: Add the Total Byte Size of Files in Folders

Mostly from the Carnegie-Mellon Pronouncing Dictionary (link available here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict), I have created 138,418 files that I wanted to group into 3 folders of equal size. The files are in sub-folders according to the first letter of their filename, and a first pass at dividing them equally by dividing the Length of the original List by 3 split the last two Lists in the middle of files beginning with "p". I want to verify that the 3 sub-folders contain files of equal size or even them out. This involves some file operations and Select.

In[312]:= fileNames=FileNames["*.zip",{"C:\\Users\\kwcarlso\\Documents\\Kris\\Megapedia-Local\\Reference-English\\Target Files\\Reference-English"}]

Here and below I've abbreviated the output with "{filename, ..., filename}".

Out[312]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\a.zip, ... ,C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\z.zip}

How to best tally the file byte size in each sub-folder? I considered using GatherBy to group the FileNames into the same groups as the sub-folders, and that probably would have worked, but would have been complicated to use three predicates, each with a range of letters in it. Discretion is always the better part of valor in programming, even to the point of just adding up the byte sizes by hand if this task were a one-off and to let me get on to the next task.

I decided to apply the principle of "divide-and-conquer" and use Select to group each set of files individually, then add byte sizes for each result. The predicate in Select is a set-theoretic operation, which indicated using a set-theoretic function, MemberQ. An original design principle of Mathematica's pre-cursor, Symbolic Manipulation Program (SMP), was to transparently map to all common mathematical functions and syntax, and it is simplest to do just that if possible. I use the infix functional style for MemberQ because it seems a bit more readable than functional bracketed style (but that's a trivial personal preference).

My first attempt failed since I forgot that CharacterRange is case-sensitive. I changed "A" and "F" to "a" and "f" and it worked perfectly. We need the final #& since the Select test works by simply applying the predicate to each element of the first argument List that you provide it.

In[316]:= fileNamesAF=Select[fileNames,CharacterRange["a","f"]~MemberQ~First@Characters@FileNameTake@#&]

Out[316]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\a.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\f.zip}

In[317]:= fileNamesGP=Select[fileNames,CharacterRange["g","p"]~MemberQ~First@Characters@FileNameTake@#&]

Out[317]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\g.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\p.zip}

In[318]:= fileNamesQZ=Select[fileNames,CharacterRange["q","z"]~MemberQ~First@Characters@FileNameTake@#&]
Out[318]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\q.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\z.zip}

At this point all we need to do is Map FileByteCount onto the filename in each folder, which means at Level 2 only (note the brackets around 2).

In[319]:= Map[FileByteCount,{fileNamesAF,fileNamesGP,fileNamesQZ},{2}]

Out[319]= {{11154319,14865868,17087067,12155371,7279606,8313716},{8802167,9846927,5719874,2458146,6091626,8605940,14729204,4588698,4529191,13363263},{795821,11363572,22685379,9138068,3902314,3644541,6431739,119495,1027761,1314296}}

Replacing List with Plus to add the numbers in each List is a common operation. When referring to Output I often use the line number instead of % since I can then change the function and re-Evaluate without modifiying it (to %%, %3, or then using the line number).

In[321]:= Plus@@#&/@%319

Out[321]= {70855947,78735036,60422986}

The result told me that moving the files beginning with "P" from the middle set into the last set would even those last two groups out.

Friday, April 11, 2014

An Advanced Predicate and Analysis of How It Works with Trace and Timing

Wagner gives a different version of allodd@x that is more efficient than the one shown here. This function stops testing as soon as it finds a number that is not odd. Contrary to his predisposition, Wagner uses a procedural loop, "de-constructing," as he calls it, the List to operate on its Parts.

In[269]:= allOddProcedural@aList_List:=Module[{i},
For[i=1,i<=Length@aList,i++,If[!OddQ[aList[[i]]],Break[]]];i>Length@aList]

But I bet OddQ is Listable and therefore its C compilation would be much faster than the For loop. First let's compare the Wagner allOddQ from above with his procedural one.

Wagner's function Breaks after OddQ returns False when testing the third argument, 6 (near the end of the Trace). Note the 100-fold difference in Timing!

In[275]:= list2={3,5,6,7};

In[276]:= allOdd@list2//Trace

Out[276]= {{list2,{3,5,6,7}},allOdd[{3,5,6,7}],Length[Select[{3,5,6,7},OddQ]]==Length[{3,5,6,7}],{{Select[{3,5,6,7},OddQ],{OddQ[3],True},{OddQ[5],True},{OddQ[6],False},{OddQ[7],True},{3,5,7}},Length[{3,5,7}],3},{Length[{3,5,6,7}],4},3==4,False}

In[277]:= allOddProcedural@list2//Trace

Out[277]= {{list2,{3,5,6,7}},allOddProcedural[{3,5,6,7}],Module[{i$},For[i$=1,i$<=Length[{3,5,6,7}],i$++,If[!OddQ[{3,5,6,7}[[i$]]],Break[]]];i$>Length[{3,5,6,7}]],{For[i$41099=1,i$41099<=Length[{3,5,6,7}],i$41099++,If[!OddQ[{3,5,6,7}[[i$41099]]],Break[]]];i$41099>Length[{3,5,6,7}],{For[i$41099=1,i$41099<=Length[{3,5,6,7}],i$41099++,If[!OddQ[{3,5,6,7}[[i$41099]]],Break[]]],{i$41099=1,1},{{i$41099,1},{Length[{3,5,6,7}],4},1<=4,True},{{{{{i$41099,1},{3,5,6,7}[[1]],3},OddQ[3],True},!True,False},If[False,Break[]],Null},{i$41099++,{i$41099,1},{i$41099=2,2},1},{{i$41099,2},{Length[{3,5,6,7}],4},2<=4,True},{{{{{i$41099,2},{3,5,6,7}[[2]],5},OddQ[5],True},!True,False},If[False,Break[]],Null},{i$41099++,{i$41099,2},{i$41099=3,3},2},{{i$41099,3},{Length[{3,5,6,7}],4},3<=4,True},{{{{{i$41099,3},{3,5,6,7}[[3]],6},OddQ[6],False},!False,True},If[True,Break[]],Break[]}},{{i$41099,3},{Length[{3,5,6,7}],4},3>4,False},False},False}

In[262]:= list1=Range[1,10^7,2];

Timing[#@list1]&/@{allOddProcedural,allodd}

{{9.297660,True},{0.109201,<<1>>}}

Why Use Built-in Functions? They're Already Written, De-bugged, and Fast

Now let's try my premise - first, OddQ is Listable as I thought.

In[260]:= Attributes@OddQ

Out[260]= {Listable,Protected}

Let's write the simple function:

In[279]:= allOddFunctional@aList_List:=If[!OddQ@aList,i>Length@aList]

In[280]:= Timing[allOddFunctional@list1]

Out[280]= {0.093601,<<1>>}

allOddFunctional is 10X faster than Wagner's procedural function, even though it appears to check every argument:

In[281]:= allOddFunctional@list2//Trace

Out[281]= {{list2,{3,5,6,7}},allOddFunctional[{3,5,6,7}],If[!OddQ[{3,5,6,7}],i>Length[{3,5,6,7}]],{{OddQ[{3,5,6,7}],{OddQ[3],OddQ[5],OddQ[6],OddQ[7]},{OddQ[3],True},{OddQ[5],True},{OddQ[6],False},{OddQ[7],True},{True,True,False,True}},!{True,True,False,True}},If[!{True,True,False,True},i>Length[{3,5,6,7}]]}

The lesson is simply that built-in functions are almost always faster than ones we make if for no other reason than they are compiled into C.

Here Wagner shows the use of While instead of For to stop when it finds an odd number.

allodd@aList_List:=Module[{i=1},While[i<=Length@aList&&OddQ@aList[[i]],i++];i>Length@aList]

Here is a Dropbox link to Wagner's book. High kudos to "Mr Wizard Todd" for asking McGraw-Hill for the right to distribute Wagner's out-of-print book. Here's his post on the StackExchange Mathematica forum.

https://www.dropbox.com/s/kllwg6y44p8va5g/Wagner%20All%20Parts-RC.pdf

Here is a link to a compressed file: http://www.verbeia.com/mathematica/PowerProgMa.zip

More Complex Predicates: allOddQ, allIdenticalQ, subsetQ

"Mr Wizard Todd", Manfred Plagmann, and Sophia Scheibe have done Mathematica users a nice favor by getting permission from McGraw-Hill to distribute licensed copies of David Wagner' s superb book, Power Programming in Mathematica. (Dropbox link: https://www.dropbox.com/s/j2dsyvptnxjd369/Wagner%20All%20Parts-RC.pdf)

A related post is How to See the Equivalence of Select and Cases.

Here are some nice examples of predicates from Wagner (re - written in my Prefix/Postfix dialect).Test a list to see if all of its entries are odd integers.

allOdd@aList_List:=Length@Select[aList,OddQ]==Length@aList

allOdd@{2,3,5,6}

False

allOdd@{3,5,7,9}

True

This predicate tests a List to see if its parts are identical. Count returns all Parts that are Equal to the First Part.

identicalListPartsQ@aList_List:=Count[aList, First@aList] == Length@aList

identicalListPartsQ@{a,b,c,4}

False

identicalListPartsQ@{a,a,a,a}

True

SameQ vs. Equal and Testing for Lists

SubsetQ determines if its first argument is a subset of its second argument and returns True or False. Wagner's function did not include a List test and he notes on Equal vs.SameQ : "The use of === rather than == makes SubsetQ return False if either set1 or set2 is not a list. Try it with Equal."

I use the Head test: set1_List, which rejects a non-List argument before evaluation. And more importantly, SameQ should always be used to test non-numerical equivalence. In fact using Equal here can give screwy results in part because of its usage as the mathematical "equals" (=) in Mathematica's syntax for equations.

When using abbreviated operators, we should first determine their Precedence:

Precedence/@{Union,Equal,SameQ}

{300.,290.,290.}

Since Union is slightly stickier than SameQ, the Union will be performed before the SameQ test.

subsetQ[set1_,set2_]:=set1∪set2===set2

subsetQ[{a,b,c},{a,b,d}]

False

subsetQ[{a,b},{a,b,2}]

False

Union Sorts Its Result

Whoops! Wagner's function didn't work. Let's find out why:

subsetQ[{a,b},{a,b,2}]//Trace

{subsetQ[{a,b},{a,b,2}],{a,b}∪{a,b,2}==={a,b,2},{{a,b}∪{a,b,2},{2,a,b}},{2,a,b}==={a,b,2},False}

The fact that Union returns a sorted List fouls up the comparison. Let's try a fix with Sort:

Clear@subsetQ;subsetQ[set1_,set2_]:=set1∪set2===Sort@set2

And it works - you see in the last step Sort fixes the order.

subsetQ[{a,b},{a,b,2}]//Trace

{subsetQ[{a,b},{a,b,2}],{a,b}∪{a,b,2}===Sort[{a,b,2}],{{a,b}∪{a,b,2},{2,a,b}},{Sort[{a,b,2}],{2,a,b}},{2,a,b}==={2,a,b},True}

Greater (>) is a "non-Q" predicate so the last statement (lacking a return-suppressing semi-colon), returns True or False.

More on predicates: http://mathematica-guide.blogspot.com/2015/07/how-to-see-equivalence-of-select-and.html

Create Predicates to Test Strings vs Numerics: Test for File Extensions

Elsewhere I tell why predicates are important to use, show how to find all predicates in Mathematica including ones not suffixed with "Q", and create ones of your own to test numerical expressions. Here I cover predicates for testing Strings, and create some tests for file types, as judged by their extensions. Note that FileExtension does not capture the period before the extension.

Equal (==) and SameQ (===) are valid predicates for Strings as well as numerics, and work fine for simple applications.

In[206]:= uncusFileQ@fileName_String := FileExtension@fileName == "unc"

In[207]:= uncusFileQ@"afile.unc"

Out[207]= True

In[208]:= uncusFileQ@"afile.txt"

Out[208]= False

In[209]:= textFileQ@fileName_String := FileExtension@fileName == "txt"

In[215]:= textFileQ@"sampleFile.txt"

Out[215]= True

To test for two alternative file extensions is a little trickier. Since we are testing for String equivalence between String patterns, we use the general String predicate function StringMatchQ and Alternatives (|), not logical Or (||). An example is testing for HTML files, since their extensions come in .htm and .html flavors.

In[210]:= Clear@htmlFileQ;
htmlFileQ@fileName_String := StringMatchQ[FileExtension@fileName, "html" | "htm"]

In[212]:= htmlFileQ@"www.jognog.com/index.html"

Out[212]= True

In[213]:= htmlFileQ@"www.jognog.com/index.htm"

Out[213]= True

In[214]:= htmlFileQ@"www.jognog.com/sitemap.xml"

Out[214]= False

Friday, March 21, 2014

String Processing: Compose Last Name + First Initial and Create Email Address

String Processing: Compose First Name + First Initial and Create Email Addresses

I've imported a List of teacher's names. Now I need to create usernames and email addresses for them to upload to an app. Here is the List; note there are nested Lists within it.

In[226]:= teacherNames={{{"Bazan, Dorothy","Bleakney, Tom"},{"Calorio, Marie","Cardoso, Brian","Carpenito, Courtney","Carton, Jacqueline"},{"Coffey, Tara","Conaty, Thomas"}},{{"Davenport, Joseph","DiSabatino, Jennifer"},{"Doran, Robert","Dsida, Terri"},"Eramo, Paula","Needle, Julie",{"O'Connell, Marisa","O'Neil, Christopher"}}};

The nesting is just going to get in the way so let's remove extra braces with Flatten. Incidentally, you need not hesitate to build up deeply nested Lists in a function and then Flatten them all at once, which is very fast. To keep seeing these names as Strings, we Postfix InputForm to display the quotation marks.

In[227]:= teacherNamesFlattened=Flatten@teacherNames;teacherNamesFlattened//InputForm

Out[227]//InputForm=
{"Bazan, Dorothy", "Bleakney, Tom", "Calorio, Marie", "Cardoso, Brian",
"Carpenito, Courtney", "Carton, Jacqueline", "Coffey, Tara", "Conaty, Thomas",
"Davenport, Joseph", "DiSabatino, Jennifer", "Doran, Robert", "Dsida, Terri",
"Eramo, Paula", "Needle, Julie", "O'Connell, Marisa", "O'Neil, Christopher"}

Let's lose the apostrophes. It's as simple as using a replacement Rule sending apostrophe to no character ("" ' "" -> ""). Without InputForm we don't see the quotation marks, but unless we use ToExpression to convert it, a String remains a String forever.

In[228]:= teacherNamesFlattenedCleaned=StringReplace[teacherNamesFlattened,"'"->""]
Out[228]= {Bazan, Dorothy,Bleakney, Tom,Calorio, Marie,Cardoso, Brian,Carpenito, Courtney,Carton, Jacqueline,Coffey, Tara,Conaty, Thomas,Davenport, Joseph,DiSabatino, Jennifer,Doran, Robert,Dsida, Terri,Eramo, Paula,Needle, Julie,OConnell, Marisa,ONeil, Christopher}

To compose usernames, we'll use a standard format, for instance, lastName + firstInitial. We use StringReplace, which takes one or more replacement Rules as arguments. We construct a pattern using deconstructed pieces of the first and last name. Note that named Blank patterns work in this type of String pattern matching.

Since I'm now going to perform several operations on the source List, for clarity, I will use Postfix with Function syntax and avoid deeply nested, hard-to-read code. The rationale behind this syntax is in A New Mathematica Programming Style Part 2..

This code works, but what is going wrong?

In[229]:= userNames=teacherNamesFlattenedCleaned//StringReplace[#,lastName__<>", "<>firstName__:>lastName<>First@Characters@firstName]&

During evaluation of In[229]:= StringJoin::string: String expected at position 1 in lastName__<>, <>firstName__. >>
During evaluation of In[229]:= StringJoin::string: String expected at position 3 in lastName__<>, <>firstName__. >>

Out[229]= {BazanD,BleakneyT,CalorioM,CardosoB,CarpenitoC,CartonJ,CoffeyT,ConatyT,DavenportJ,DiSabatinoJ,DoranR,DsidaT,EramoP,NeedleJ,OConnellM,ONeilC}

The error is due to using StringJoin ("<>") correctly to compose the deconstructed Strings but incorrectly to deconstruct them with patterns. Using the correct operator, StringExpression ("~~") solves the problem. Also let's convert the usernames to all lower case. Since ToLowerCase is Listable, we don't need to Map it over all the usernames.

In[230]:= userNames=teacherNamesFlattenedCleaned//StringReplace[#,lastName__~~", "~~firstName__:>lastName<>First@Characters@firstName]&//ToLowerCase

Out[230]= {bazand,bleakneyt,caloriom,cardosob,carpenitoc,cartonj,coffeyt,conatyt,davenportj,disabatinoj,doranr,dsidat,eramop,needlej,oconnellm,oneilc}

We can just as easily compose usernames with a reversed form, firstInitial + lastName.

In[231]:= userNames2=teacherNamesFlattenedCleaned//StringReplace[#,lastName__~~", "~~firstName__:>First@Characters@firstName<>lastName]&//ToLowerCase

Out[231]= {dbazan,tbleakney,mcalorio,bcardoso,ccarpenito,jcarton,tcoffey,tconaty,jdavenport,jdisabatino,rdoran,tdsida,peramo,jneedle,moconnell,coneil}

To compose email addresses, in a similar fashion, we'll use a template form, firstName.lastName@schoolDomain.org.

In[232]:= emailAddresses=teacherNamesFlattenedCleaned//StringReplace[#,lastName__~~", "~~firstName__:>firstName<>"."<>lastName<>"@hometownschools.org"]&//ToLowerCase

Out[232]= {dorothy.bazan@hometownschools.org,tom.bleakney@hometownschools.org,marie.calorio@hometownschools.org,brian.cardoso@hometownschools.org,courtney.carpenito@hometownschools.org,jacqueline.carton@hometownschools.org,tara.coffey@hometownschools.org,thomas.conaty@hometownschools.org,joseph.davenport@hometownschools.org,jennifer.disabatino@hometownschools.org,robert.doran@hometownschools.org,terri.dsida@hometownschools.org,paula.eramo@hometownschools.org,julie.needle@hometownschools.org,marisa.oconnell@hometownschools.org,christopher.oneil@hometownschools.org}

Now I'd like to make one List with both username and email address. Here is a typical use of Thread, where using the abbreviated operator for List, instead of Thread[List[userNames2,emailAddresses]] makes the code simple to understand.

In[233]:= namesAndEmails=Thread[{userNames2,emailAddresses}]

Out[233]= {{dbazan,dorothy.bazan@hometownschools.org},{tbleakney,tom.bleakney@hometownschools.org},{mcalorio,marie.calorio@hometownschools.org},{bcardoso,brian.cardoso@hometownschools.org},{ccarpenito,courtney.carpenito@hometownschools.org},{jcarton,jacqueline.carton@hometownschools.org},{tcoffey,tara.coffey@hometownschools.org},{tconaty,thomas.conaty@hometownschools.org},{jdavenport,joseph.davenport@hometownschools.org},{jdisabatino,jennifer.disabatino@hometownschools.org},{rdoran,robert.doran@hometownschools.org},{tdsida,terri.dsida@hometownschools.org},{peramo,paula.eramo@hometownschools.org},{jneedle,julie.needle@hometownschools.org},{moconnell,marisa.oconnell@hometownschools.org},{coneil,christopher.oneil@hometownschools.org}}

Finally I'll Export the List to a file in my Documents directory. It is safest to use FileNameJoin to compose a filename, since it will catch syntax errors as well as make sure it's in the correct format for your operating system.

In[234]:= Export[FileNameJoin@{$UserDocumentsDirectory,"HomeTownUserNames.xlsx"}, namesAndEmails]

Out[234]= C:\Users\kwcarlso\Documents\HomeTownUserNames.xlsx

Sunday, March 9, 2014

Example of a File of Functions to Pre-Load at Startup

Here is a portion of my file of functions that I load on startup so they're available in every session. See also Load Functions at Startup Using Initialization Files.

Functions to Pre-Load

String Processing

Cleaners

trimWhiteSpace::usage="trimWhiteSpace is a simple stringtrim function that does a better job than StringTrim since it removes all WhiteSpaceCharacters from beginning, end or middle of Strings. It is Listable and is intended to Map over all fields in all records where all fields are Strings, no matter if they are wrapped in Lists at any level."

Clear@trimWhiteSpace;

SetAttributes[trimWhiteSpace,Listable];

trimWhiteSpace@stringFile_:=StringReplace[stringFile,WhitespaceCharacter->""];

cleanseText::usage="cleanseText is a moderately thorough stringtrim function. The regular expression removes whitespace characters from the beginning and end of the string, including non-breaking spaces. StringReplace removes apostrophes and commas. It is Listable and intended to automatically Map over all fields in all records where all fields are Strings, no matter if they are wrapped in Lists at any level. These rules should be expanded to include more data cleansing functions as needed."

Clear@cleanseText;

SetAttributes[cleanseText,Listable];

cleanseText@text_String:=StringReplace[text,RegularExpression["^(\\s)*|(\\s)*$"]->""]//StringReplace[#,{"'"->"",WhitespaceCharacter->" ","("|")"->""}]&;

Web Page Functions

Make a URL from a String

makeURL::usage="makeURL takes a String to use as the prefix of the URL, cleans it up (removes whitespace and non-breaking spaces (which are often invisible), puts the String in lower case, replaces single spaces with hyphens and finally appends '.html'. Example: 'My Home Page' -> my-home-page.html".

Clear@makeURL;

SetAttributes[makeURL,Listable];

makeURL@urlName_String:=ToLowerCase@cleanseText@urlName//StringReplace[#,Whitespace->"-"]&//StringReplace[#,"---"->"-"]&//FileNameJoin[{#<>".html"}]&;

Load Functions at Startup Using Initialization Files

See also Example of a File of Functions to Pre-Load at Startup.

When you quit the kernel, Mathematica forgets everything that has happened in a session, such as variable assignments and function definitions. But you can use init.m files to automatically load anything from simple functions, to arbitrarily complex programs, or even external program calls that you want it to know or do on startup. When you accumulate functions that you want to always be available, you can save them in a text file or Package and load them on startup.

There are init.m files in a number of locations—the two intended for user use are in your $UserBaseDirectory /Kernel and /FrontEnd folders. Kernel/init.m will run when you start the Kernel while FrontEnd/init.m will run when you first open a Notebook during a Mathematica session. I recommend using the Kernal init.m since the Front End init.m typically has a lot of stuff in it and since you may Quit the Kernel during a session and want to re-load what's in your init.m file when you re-start it.

First, locate your user init.m files. "Environment variables", which are parameters set for your local environment, i.e. your computer and user account, are one category of $commands. $UserBaseDirectory is for your personal use.

In[1]:= $UserBaseDirectory

Out[1]= "C:\\Users\\kwcarlso\\AppData\\Roaming\\Mathematica"

We can search for any init.m file in this Directory. Note the inclusion of Infinity or the search would leave out sub-directories.

In[2]:= initFiles = FileNames["init.m", $UserBaseDirectory, Infinity]

Out[2]= {"C:\\Users\\kwcarlso\\AppData\\Roaming\\Mathematica\\FrontEnd\\init.m", \
"C:\\Users\\kwcarlso\\AppData\\Roaming\\Mathematica\\Kernel\\init.m"}

Let's take a quick look at the contents of the Kernel init.m using FilePrint, which displays the contents of a text file like a text editor without executing any functions. Here is my init.m files after I modified it. You can see it will Print a Message saying hello and telling me its location, then load some functions that I want available in any Session.

In[3]:= FilePrint@Last@initFiles

(** User Mathematica initialization file **)

Print@"Hello World from your kernel init.m file! I'm located in:\n
C:\\Users\\kwcarlso\\AppData\\Roaming\\Mathematica\\Kernel."

<<"C:\\Users\\kwcarlso\\Dropbox\\Mathematica\\Initialization\\Pre-Load Functions.txt"(** User Mathematica initialization file **)

Print@"Hello World from your kernel init.m file! I'm located in:\n
C:\\Users\\kwcarlso\\AppData\\Roaming\\Mathematica\\Kernel."

<<"C:\\Users\\kwcarlso\\Dropbox\\Mathematica\\Initialization\\Pre-Load Functions.txt"

To set this up similarly for yourself:

Locate your Kernel/init.m file as shown above.
Copy and paste its filepath into your file manager (e.g. Windows Explorer) or otherwise navigate there.
Open the init.m file with a text editor (e.g. Notepad).
Add the Print statement with your init.m location.
Add the filepath to your pre-load functions file.

The easiest way to prepare functions for automatic re-use is to save them in a text file. Two other methods are using Initialization Cells or Packages. Functions stored in a text file or a Package are loaded with Get (<<), which does execute them as it reads them in. Initialization Cells (menu: Cell/Cell Properties/Initialization Cell or /Initialization Group) are the most common method used by beginners.

Saving functions in a text file is a good intermediate step for beginners who aren't ready to create Packages. Just take care to start your function names with small letters so you don't accidentally use the name of a built-in function, and Clear your definitions before defining them.

Clear@functionName; functionName[parameter1_,parameter2_,...]:=definition

Appendix: Locations of All Users' init.m Files

You can execute this cell and see where the initialization directories for all users are located:

In[4]:= {$BaseDirectory,$UserBaseDirectory}//TableForm

Out[4]
C:\ProgramData\Mathematica
C:\Users\kwcarlso\AppData\Roaming\Mathematica

There are four possible init.m files to modify, depending on whether you want to initialize on Kernel or Front End startup, and for all users or just the user who is logged in.