Friday, December 11, 2015

String Processing: Validate Email Address Format, Part 1

Here are three approaches to predicates validating that an email address has an "@" sign and a period ".". A true email address validator is much more complicated, so these functions are merely to illustrate some String processing techniques. That said, I did use the first approach to validate thousands of email addresses and flush out bad ones.

The first predicate uses StringMatchQ and the essential idea is to give StringMatchQ a String pattern representing:
  1. An "@" wildcard for at least one but an unknown number of characters
  2. Followed by the "@" sign as used in an email address to separate localpart from domain
  3. Followed by the wildcard again
  4. Followed by the period "." used to separate the mail server name from the top-level domain
  5. Followed by the wildcard
Note that an asterisk wildcard ("*") would not be suitable since each position in the pattern has at least one character.

The first variation illustrates the use of Verbatim, which tells Mathematica to treat the middle "@" sign as an "@" sign  (read it "verbatim") and not as a pattern wildcard.

Clear@emailAddressQ1;emailAddressQ1@aString_String:=StringMatchQ[aString,"@"~~Verbatim@"@"~~"@.@"]

The second variation of StringMatchQ does the same thing but more concisely, using a double backslash. The first backslash tells Mathematica to treat the second one as an 'escape' character, which then tell Mathematica to treat the "@" sign as an "@" sign and not as a wildcard.

Clear@emailAddressQ1a;emailAddressQ1a@aString_String:=StringMatchQ[aString,"*\\@*.*"]

The second basic approach uses StringContainsQ to detect the presence of the "@" sign and period "." in the address. Since it doesn't require additional characters in the correct positions as does the first approach, it is not as good a validator.

Clear@emailAddressQ2;emailAddressQ2@aString_String:=StringContainsQ[aString,"@"]&&StringContainsQ[aString,"."]

The third approach uses StringFreeQ to see if the address has no "@" sign and period ".", and some logic, and suffers the same limitation as the second approach.

Clear@emailAddressQ3;emailAddressQ3@aString_String:=Not[StringFreeQ[aString,"@"]&&StringFreeQ[aString,"."]]

Test Cases


The test cases are 1) valid email format, 2) missing "@" sign, 3) missing "." and 4) missing both "@" and ".".

testCases={"abc@dwxy.anything","abcdwxy.anything","abc@dwxyanything","abcdwxyanything"};

emailAddressQ1/@testCases

{True,False,False,False}

emailAddressQ1a/@testCases

{True,False,False,False}

emailAddressQ2/@testCases

{True,False,False,False}

emailAddressQ3/@testCases

{True,True,True,False}

Aha! The test cases flushed out a logic error. The way the emailAddressQ3 function is written with the AND (&&) conjunction between the two clauses, only if both StringMatchQ tests are FALSE will AND return FALSE and be negated by NOT into TRUE. We need to change the AND to OR so that if either "@" or "." are missing, the OR returns TRUE and is negated by NOT into FALSE.

Clear@emailAddressQ4;emailAddressQ4@aString_String:=Not[StringFreeQ[aString,"@"]||StringFreeQ[aString,"."]]

emailAddressQ4/@testCases

{True,False,False,False}

No comments:

Post a Comment