20090318 validating e mail address formatting - plembo/onemoretech GitHub Wiki

title: Validating e-mail address formatting link: https://onemoretech.wordpress.com/2009/03/18/validating-e-mail-address-formatting/ author: lembobro description: post_id: 356 created: 2009/03/18 17:52:14 created_gmt: 2009/03/18 17:52:14 comment_status: open post_name: validating-e-mail-address-formatting status: publish post_type: post

Validating e-mail address formatting

Erroneously formatted e-mail addresses are part of life. Most of the time they’re just an annoyance, but sometimes they can actually “gum up the works” as my grandfather used to say.

What an properly formatted e-mail address is supposed to look like is governed by RFC 5322, specifically in sec. 3.4 and within. The definition is more permissive than most would think. While the domain name part of an address (the part after the “@” sign) must follow the pattern “domainname” dot (.) “topleveldomaincomponent”, e.g. “example.com” in 7-bit ASCII, the “local-part” (the part before the “@” sign) can basically be any string of ASCII characters, including quotes. In practice IETF cautions against the use of quoted strings and other innovations to avoid running afoul of other protocol specifications. The wikipedia article on E-mail address states the rules in a more positive way:

The “local-part” of an e-mail address can be up to 64 characters and the domain name a maximum of 255 characters. Clients may attempt to use larger objects, but they must be prepared for the server to reject them if they cannot be handled by it.[1]

The local-part of the e-mail address may use any of these ASCII characters:

Uppercase and lowercase English letters (a-z, A-Z)

Digits 0 through 9
Characters ! # $ % & ‘ * + - / = ? ^ _ ` { | } ~
Character . provided that it is not the first nor last character, nor may it appear two or more times consecutively.

Additionally, quoted-strings (ie: “John Doe”@example.com) are permitted, thus allowing characters that would otherwise be prohibited, however they do not appear in common practice. RFC 5321 also warns that “a host that expects to receive mail SHOULD avoid defining mailboxes where the Local-part requires (or uses) the Quoted-string form”.

After some experimentation I found that many regular expressions that purport to validate proper e-mail address formatting came up short. One of my acid test cases was the string “RD’[email protected]”. The apostrophe (’) caused most to flag this as an invalid address. Another problem I found was that the most popular perl modules for e-mail validation, Mail::Verify and Email::Valid, either couldn’t handle simple, unescaped, lists of e-mail addresses or failed to flag clearly bogus address formats.

I got lucky when I came across the help pages of www.lyris.com, the big e-mail marketing company. All the versions of their List Manager user’s guides contain an FAQ section that includes the important question How do I validate Email Addresses in my Perl programs?. The core of the sample code provided is this regex:

/[ |t|r|n]*"?([^"]+"?@[^ t]+.[^ t][^ t]+)[ |t|r|n]*/

(note that the above should all be on one line, no linefeeds or carriage returns, please)

Armed with this, I was able to build a pretty simple script that would go out and validate if the format of the values in the mail attribute for each entry met the test.

Here is the complete script (outside of regexes, a “” indicates line continuation):

#!/usr/bin/perl
# fixemails.pl Searches directory for malformed e-mail address values
# in entry mail attribute and creates LDIF to delete them.
# Created 3/18/09 by P Lembo
# Original regular expression used was:
# /^[w.-]+@[w.-]+.[a-zA-Z]{2,5}$/g
# This detected RD'[email protected]' as an illegal address, when it
# is not per RFC5322. As a result have changed to this:
# /[ |t|r|n]*"?([^"]+"?@[^ t]+.[^ t][^ t]+)[ |t|r|n]*/
# which does not report that as error.
# Other regexes found during research (all failed to meet the requirements):
# /[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}/g
# /(w[-.w]+@[-.w]+.w{2,4})W/g
# /^[A-z0-9_-]+[@][A-z0-9_-]+([.][A-z0-9_-]+)+[A-z]{2,4}$/g
use strict;
use Net::LDAP;
use Net::LDAP::Entry;
use Net::LDAP::LDIF;
my $HOME = $ENV{’HOME’};
our($dirHost,$dirUsr,$dirPass);
require “$HOME/etc/config.conf”;
my $port = “389”;
my $inldif = “$HOME/data/global/fixemails_raw.ldif”;
my $outldif = “$HOME/data/global/fixemails_fixed.ldif”;
my $logfile = “$HOME/data/logs/fixemails.log”;
my $APPS = “$HOME/bin”;
my $basedn = “dc=example,dc=com”;
my $query = “(mail=*)”;
my $attrs = “cn uid mail”;
my $time = localtime();
my $total =0;
my $badcount =0;
open LOGZ, “>$logfile” or die $!;
print LOGZ “$timet Starting e-mail format checkn”;
system(”$APPS/ldapsearch -x -LLL -h $dirHost -p $port -D “$dirUsr” 
 -w $dirPass -b “$basedn” -s sub “$query” $attrs >$inldif”);
my $ldif = Net::LDAP::LDIF->new($inldif, ‘r’) or die $!;
open FH, “>$outldif” or die $!;
while (not $ldif->eof() ) {
   my $entry = $ldif->read_entry();
   if ($ldif->error() ) {
       print “tError! “,$ldif->error(),”n”;
   }
   else {
	my $dn = $entry->dn;
	$total++;
	my $mail = $entry->get_value(’mail’);
	
	if ($mail =~ 
 /[ |t|r|n]*”?([^”]+”?@[^ t]+.[^ t][^ t]+)[ |t|r|n]*/) {
	    print LOGZ “”$dn”,Goodmail,”$mail”n”;
	}
	else {
	    print “”$dn”,Badmail,”$mail”n”;
	    print LOGZ “”$dn”,Badmail,”$mail”n”;
	    print FH “dn: $dnn”;
	    print FH “changetype: modifyn”;
	    print FH “delete: mailn”;
  	    print FH “n”;
	    $badcount++;
	}
    }
}
print LOGZ “Total entries: $totaln”;
print LOGZ “Total badmail: $badcountn”;
close FH;
$ldif->done;
close LOGZ;
__END__;