20090407 textparsewords and apostrophe characters - plembo/onemoretech GitHub Wiki

title: Text::ParseWords and apostrophe characters link: https://onemoretech.wordpress.com/2009/04/07/textparsewords-and-apostrophe-characters/ author: lembobro description: post_id: 345 created: 2009/04/07 16:33:42 created_gmt: 2009/04/07 16:33:42 comment_status: open post_name: textparsewords-and-apostrophe-characters status: publish post_type: post

Text::ParseWords and apostrophe characters

I do a lot of csv (comma separated values file) processing. A Lot. I also work in a global environment. As a result, things like the hardcoded default treatment of single quotes as delimiters in a perl module like [Text::ParseWords](http://search.cpan.org/~chorny/Text-ParseWords-3.27/) can be a major problem. From a computer’s point of view single quotes, of course, are indistinguishable from apostrophes. Names like “D’ALLESIO” or “L’AVENUE” don’t get processed unless they’re enclosed in double quotes, and when your data is being generated by an underpaid file clerk using MS Excel that’s not going to happen. As a result, I’ve been forced to hack Text::ParseWords to remove the ability to use single quotes as delimiters. This is acceptable in my environment because it’s standard practice to use only double quotes and commas as delimiters, just as God intended. With these kinds of changes it’s my practice to store the modified module in a different hierarchy and to rename it so there’s no confusion between the official code and what I’ve done. In the case of Text::ParseWords what I’ve implemented is a new Custom::Text::ParseWords for this purpose.` My changes below (this is based on v3.27 of ParseWords.pm):

sub parse_line {
    my($delimiter, $keep, $line) = @_;
    my($word, @pieces);

    no warnings 'uninitialized';        # we will be testing undef strings

    while (length($line)) {
        # This pattern is optimised to be stack conservative on older perls.
        # Do not refactor without being careful and testing it on very long strings.
        # See Perl bug #42980 for an example of a stack busting input.
        $line =~ s/^
                    (?: 
                        # double quoted string
                        (")                             # $quote
                        ((?>[^\"]*(?:\.[^\"]*)*))"   # $quoted 
                    |   # --OR--
                        # singe quoted string
                        # (')                             # $quote
                        # ((?>[^\']*(?:\.[^\']*)*))'   # $quoted                     
                        (")                             # $quote
                        ((?>[^\"]*(?:\.[^\"]*)*))"   # $quoted
                    |   # --OR--
                        # unquoted string
                        (                               # $unquoted 
                            # (?:\.|[^\"'])*?           
                            (?:\.|[^\"])*?           
                        )               
                        # followed by
                        (                               # $delim
                            Z(?!n)                    # EOL
                        |   # --OR--
                            (?-x:$delimiter)            # delimiter
                        |   # --OR--                    
                            # (?!^)(?=["'])               # a quote
                            (?!^)(?=["])               # a quote
                        )  
                    )//xs or return;            # extended layout                  
        my ($quote, $quoted, $unquoted, $delim) = (($1 ? ($1,$2) : ($3,$4)), $5, $6);


        return() unless( defined($quote) || length($unquoted) || length($delim));

        if ($keep) {
            $quoted = "$quote$quoted$quote";
        }
        else {
            $unquoted =~ s/\(.)/$1/sg;
            if (defined $quote) {
                $quoted =~ s/\(.)/$1/sg if ($quote eq '"');
                # $quoted =~ s/\([\'])/$1/g if ( $PERL_SINGLE_QUOTE && $quote eq "'");
            }
        }
        $word .= substr($line, 0, 0);   # leave results tainted
        $word .= defined $quote ? $quoted : $unquoted;

        if (length($delim)) {
            push(@pieces, $word);
            push(@pieces, $delim) if ($keep eq 'delimiters');
            undef $word;
        }
        if (!length($line)) {
            push(@pieces, $word);
        }
    }
    return(@pieces);
}

In my modification I commented out lines 75, 76, 82, 91, 107 and then inserted lines 77, 78, 83 and 92.

Copyright 2004-2019 Phil Lembo