20070820 using a regex to filter a string value - plembo/onemoretech GitHub Wiki

title: Using a Regex to Filter a String Value link: https://onemoretech.wordpress.com/2007/08/20/using-a-regex-to-filter-a-string-value/ author: lembobro description: post_id: 660 created: 2007/08/20 20:57:00 created_gmt: 2007/08/20 20:57:00 comment_status: open post_name: using-a-regex-to-filter-a-string-value status: publish post_type: post

Using a Regex to Filter a String Value

Recently I had to filter a user ID value to clean it up. The problem was that whitespace, tabs and all kinds of other non-visible stuff was getting into the data as a result of poor validation on the application used to do the data entry. This was preventing us from reading the “letter followed by 6 digits” ID (e.g. “Z123456”) as required. I used a regular expression match to get the job done. Here’s my code (the variable $uid has already received the raw value):

for ($uid) {
    m/([A-Z]d{6})/i;
    $uid = $1;
}

Notice the use of the [A-Z] character class for the leading letter and the d metacharacter for digits, along with the range qualifier {n} on the number of digits. The /i tells the regex engine to make a case-insensitive match.

To filter out all non-digit characters in a less elegant way, you can use a simple search and replace operation, like this:

for($string) {
    s/D//g;
}

Which will result in $string containing only digits. This is the method I now use for filtering out formatting characters (as well as stray whitespace, tabs and other such annoyances) from telephone numbers. In Perl, the D metacharacter indicates all non-digit characters, while it’s little brother, d, represents only a digit character. The /g indicates the regex engine should do a “greedy” match. In such a match operation, the regex engine doesn’t stop with the first matched character, but keeps going to the end of the string until it has found every matching characters.

Additional Note: Just today I found I needed a way to efficiently filter out all but alphanumeric values in an attribute. This was easy, thanks to Perl’s W metacharacter. Here’s how I used it:

for($uid) {
    s/W//g;
}

The result is that anything but [A-z] or [0-9] gets nuked. Which is A Good Thing ™ when you’re talking about user ID values.