Tweet

perldoc -q string

Note

  This page is made up of entries from the standard perl FAQ, which is part
  of every Perl distribution. It was generated with the command:

  perldoc -q string

  See perldoc.perl.org FAQs for the most up to date version 
  of this and all the other standard Perl documentation.

Found in /usr/local/lib/perl5/5.8.8/pod/perlfaq4.pod

How can I take a string and turn it into epoch seconds?

If it's a regular enough string that it always has the same format, you can split it up and pass the parts to "timelocal" in the standard Time::Local module. Otherwise, you should look into the Date::Calc and Date::Manip modules from CPAN.

How do I unescape a string?

It depends just what you mean by "escape". URL escapes are dealt with in perlfaq9. Shell escapes with the backslash ("\") character are removed with

        s/\\(.)/$1/g;

This won't expand "\n" or "\t" or any other special escapes.

How do I expand function calls in a string?

(contributed by brian d foy)

This is documented in perlref, and although it's not the easiest thing to read, it does work. In each of these examples, we call the function inside the braces used to dereference a reference. If we have a more than one return value, we can construct and dereference an anonymous array. In this case, we call the function in list context.

            print "The time values are @{ [localtime] }.\n";

If we want to call the function in scalar context, we have to do a bit more work. We can really have any code we like inside the braces, so we simply have to end with the scalar reference, although how you do that is up to you, and you can use code inside the braces.

            print "The time is ${\(scalar localtime)}.\n";

            print "The time is ${ my $x = localtime; \$x }.\n";

If your function already returns a reference, you don't need to create the reference yourself.

            sub timestamp { my $t = localtime; \$t }

            print "The time is ${ timestamp() }.\n";

The "Interpolation" module can also do a lot of magic for you. You can specify a variable name, in this case "E", to set up a tied hash that does the interpolation for you. It has several other methods to do this as well.

            use Interpolation E => 'eval';
            print "The time values are $E{localtime()}.\n";

In most cases, it is probably easier to simply use string concatenation, which also forces scalar context.

            print "The time is " . localtime . ".\n";

How do I reverse a string?

Use reverse() in scalar context, as documented in "reverse" in perlfunc.

        $reversed = reverse $string;

How do I expand tabs in a string?

You can do it yourself:

        1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;

Or you can just use the Text::Tabs module (part of the standard Perl distribution).

        use Text::Tabs;
        @expanded_lines = expand(@lines_with_tabs);

How can I access or change N characters of a string?

You can access the first characters of a string with substr(). To get the first character, for example, start at position 0 and grab the string of length 1.

            $string = "Just another Perl Hacker";
        $first_char = substr( $string, 0, 1 );  #  'J'

To change part of a string, you can use the optional fourth argument which is the replacement string.

        substr( $string, 13, 4, "Perl 5.8.0" );

You can also use substr() as an lvalue.

        substr( $string, 13, 4 ) =  "Perl 5.8.0";

How can I count the number of occurrences of a substring within a string?

There are a number of ways, with varying efficiency. If you want a count of a certain single character (X) within a string, you can use the "tr///" function like so:

        $string = "ThisXlineXhasXsomeXx'sXinXit";
        $count = ($string =~ tr/X//);
        print "There are $count X characters in the string";

This is fine if you are just looking for a single character. However, if you are trying to count multiple character substrings within a larger string, "tr///" won't work. What you can do is wrap a while() loop around a global pattern match. For example, let's count negative integers:

        $string = "-9 55 48 -2 23 -76 4 14 -44";
        while ($string =~ /-\d+/g) { $count++ }
        print "There are $count negative numbers in the string";

Another version uses a global match in list context, then assigns the result to a scalar, producing a count of the number of matches.

            $count = () = $string =~ /-\d+/g;

How can I split a [character] delimited string except when inside [character]?

Several modules can handle this sort of parsing --- Text::Balanced, Text::CSV, Text::CSV_XS, and Text::ParseWords, among others.

Take the example case of trying to split a string that is comma-separated into its different fields. You can't use "split(/,/)" because you shouldn't split if the comma is inside quotes. For example, take a data line like this:

        SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"Error, Core Dumped"

Due to the restriction of the quotes, this is a fairly complex problem. Thankfully, we have Jeffrey Friedl, author of *Mastering Regular Expressions*, to handle these for us. He suggests (assuming your string is contained in $text):

         @new = ();
         push(@new, $+) while $text =~ m{
             "([^\"\\]*(?:\\.[^\"\\]*)*)",?  # groups the phrase inside the quotes
           | ([^,]+),?
           | ,
         }gx;
         push(@new, undef) if substr($text,-1,1) eq ',';

If you want to represent quotation marks inside a quotation-mark-delimited field, escape them with backslashes (eg, "like \"this\"".

Alternatively, the Text::ParseWords module (part of the standard Perl distribution) lets you say:

        use Text::ParseWords;
        @new = quotewords(",", 0, $text);

There's also a Text::CSV (Comma-Separated Values) module on CPAN.

How do I strip blank space from the beginning/end of a string?

(contributed by brian d foy)

A substitution can do this for you. For a single line, you want to replace all the leading or trailing whitespace with nothing. You can do that with a pair of substitutions.

            s/^\s+//;
            s/\s+$//;

You can also write that as a single substitution, although it turns out the combined statement is slower than the separate ones. That might not matter to you, though.

            s/^\s+|\s+$//g;

In this regular expression, the alternation matches either at the beginning or the end of the string since the anchors have a lower precedence than the alternation. With the "/g" flag, the substitution makes all possible matches, so it gets both. Remember, the trailing newline matches the "\s+", and the "$" anchor can match to the physical end of the string, so the newline disappears too. Just add the newline to the output, which has the added benefit of preserving "blank" (consisting entirely of whitespace) lines which the "^\s+" would remove all by itself.

            while( <> )
                    {
                    s/^\s+|\s+$//g;
                    print "$_\n";
                    }

For a multi-line string, you can apply the regular expression to each logical line in the string by adding the "/m" flag (for "multi-line"). With the "/m" flag, the "$" matches *before* an embedded newline, so it doesn't remove it. It still removes the newline at the end of the string.

        $string =~ s/^\s+|\s+$//gm;

Remember that lines consisting entirely of whitespace will disappear, since the first part of the alternation can match the entire string and replace it with nothing. If need to keep embedded blank lines, you have to do a little more work. Instead of matching any whitespace (since that includes a newline), just match the other whitespace.

            $string =~ s/^[\t\f ]+|[\t\f ]+$//mg;

How do I pad a string with blanks or pad a number with zeroes?

In the following examples, $pad_len is the length to which you wish to pad the string, $text or $num contains the string to be padded, and $pad_char contains the padding character. You can use a single character string constant instead of the $pad_char variable if you know what it is in advance. And in the same way you can use an integer in place of $pad_len if you know the pad length in advance.

The simplest method uses the "sprintf" function. It can pad on the left or right with blanks and on the left with zeroes and it will not truncate the result. The "pack" function can only pad strings on the right with blanks and it will truncate the result to a maximum length of $pad_len.

        # Left padding a string with blanks (no truncation):
            $padded = sprintf("%${pad_len}s", $text);
            $padded = sprintf("%*s", $pad_len, $text);  # same thing

        # Right padding a string with blanks (no truncation):
            $padded = sprintf("%-${pad_len}s", $text);
            $padded = sprintf("%-*s", $pad_len, $text); # same thing

        # Left padding a number with 0 (no truncation):
            $padded = sprintf("%0${pad_len}d", $num);
            $padded = sprintf("%0*d", $pad_len, $num); # same thing

        # Right padding a string with blanks using pack (will truncate):
        $padded = pack("A$pad_len",$text);

If you need to pad with a character other than blank or zero you can use one of the following methods. They all generate a pad string with the "x" operator and combine that with $text. These methods do not truncate $text.

Left and right padding with any character, creating a new string:

        $padded = $pad_char x ( $pad_len - length( $text ) ) . $text;
        $padded = $text . $pad_char x ( $pad_len - length( $text ) );

Left and right padding with any character, modifying $text directly:

        substr( $text, 0, 0 ) = $pad_char x ( $pad_len - length( $text ) );
        $text .= $pad_char x ( $pad_len - length( $text ) );

How do I extract selected columns from a string?

Use substr() or unpack(), both documented in perlfunc. If you prefer thinking in terms of columns instead of widths, you can use this kind of thing:

        # determine the unpack format needed to split Linux ps output
        # arguments are cut columns
        my $fmt = cut2fmt(8, 14, 20, 26, 30, 34, 41, 47, 59, 63, 67, 72);

        sub cut2fmt {
            my(@positions) = @_;
            my $template  = '';
            my $lastpos   = 1;
            for my $place (@positions) {
                $template .= "A" . ($place - $lastpos) . " ";
                $lastpos   = $place;
            }
            $template .= "A*";
            return $template;
        }

How do I find the soundex value of a string?

(contributed by brian d foy)

You can use the Text::Soundex module. If you want to do fuzzy or close matching, you might also try the String::Approx, and Text::Metaphone, and Text::DoubleMetaphone modules.

How can I expand variables in text strings?

Let's assume that you have a string that contains placeholder variables.

        $text = 'this has a $foo in it and a $bar';

You can use a substitution with a double evaluation. The first /e turns $1 into $foo, and the second /e turns $foo into its value. You may want to wrap this in an "eval": if you try to get the value of an undeclared variable while running under "use strict", you get a fatal error.

        eval { $text =~ s/(\$\w+)/$1/eeg };
        die if $@;

It's probably better in the general case to treat those variables as entries in some special hash. For example:

        %user_defs = (
            foo  => 23,
            bar  => 19,
        );
        $text =~ s/\$(\w+)/$user_defs{$1}/g;

Found in /usr/local/lib/perl5/5.8.8/pod/perlfaq5.pod

How can I write() into a string?

See "Accessing Formatting Internals" in perlform for an swrite() function.

Found in /usr/local/lib/perl5/5.8.8/pod/perlfaq6.pod

How can I match strings with multibyte characters?

Starting from Perl 5.6 Perl has had some level of multibyte character support. Perl 5.8 or later is recommended. Supported multibyte character repertoires include Unicode, and legacy encodings through the Encode module. See perluniintro, perlunicode, and Encode.

If you are stuck with older Perls, you can do Unicode with the "Unicode::String" module, and character conversions using the "Unicode::Map8" and "Unicode::Map" modules. If you are using Japanese encodings, you might try using the jperl 5.005_03.

Finally, the following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter.

Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase letters encode single Martian letters (i.e. the two bytes "CV" make a single Martian letter, as do the two bytes "SG", "VS", "XX", etc.). Other bytes represent single characters, just like ASCII.

So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

Now, say you want to search for the single character "/GX/". Perl doesn't know about Martian, so it'll find the two bytes "GX" in the "I am CVSGXX!" string, even though that character isn't there: it just looks like it is because "SG" is next to "XX", but there's no real "GX". This is a big problem.

Here are a few ways, all painful, to deal with it:

       $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent "martian"
                                          # bytes are no longer adjacent.
       print "found GX!\n" if $martian =~ /GX/;

Or like this:

       @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
       # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
       #
       foreach $char (@chars) {
           print "found GX!\n", last if $char eq 'GX';
       }

Or like this:

       while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
           print "found GX!\n", last if $1 eq 'GX';
       }

Here's another, slightly less painful, way to do it from Benjamin Goldberg, who uses a zero-width negative look-behind assertion.

            print "found GX!\n" if  $martian =~ m/
                       (?<![A-Z])
                       (?:[A-Z][A-Z])*?
                       GX
                    /x;

This succeeds if the "martian" character GX is in the string, and fails otherwise. If you don't like using (?<!), a zero-width negative look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).

It does have the drawback of putting the wrong thing in $-[0] and $+[0], but this usually can be worked around.

Found in /usr/local/lib/perl5/5.8.8/pod/perlfaq7.pod

Do I always/never have to quote my strings or use semicolons and commas?

Normally, a bareword doesn't need to be quoted, but in most cases probably should be (and must be under "use strict"). But a hash key consisting of a simple word (that isn't the name of a defined subroutine) and the left-hand operand to the "=>" operator both count as though they were quoted:

        This                    is like this
        ------------            ---------------
        $foo{line}              $foo{'line'}
        bar => stuff            'bar' => stuff

The final semicolon in a block is optional, as is the final comma in a list. Good style (see perlstyle) says to put them in except for one-liners:

        if ($whoops) { exit 1 }
        @nums = (1, 2, 3);

        if ($whoops) {
            exit 1;
        }
        @lines = (
            "There Beren came from mountains cold",
            "And lost he wandered under leaves",
        );

Found in /usr/local/lib/perl5/5.8.8/pod/perlfaq9.pod

How do I remove HTML from a string?

The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like "s/<.*?>//g", but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like "&lt;" for example.

Here's one "simple-minded" approach, that works for most files:

        #!/usr/bin/perl -p0777
        s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

        <IMG SRC = "foo.gif" ALT = "A > B">

        <IMG SRC = "foo.gif"
             ALT = "A > B">

        <!-- <A comment> -->

        <script>if (a<b && a>c)</script>

        <# Just data #>

        <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

        <!-- This section commented out.
            <B>You can't see me!</B>
        -->

How do I decode a MIME/BASE64 string?

The MIME-Base64 package (available from CPAN) handles this as well as the MIME/QP encoding. Decoding BASE64 becomes as simple as:

        use MIME::Base64;
        $decoded = decode_base64($encoded);

The MIME-Tools package (available from CPAN) supports extraction with decoding of BASE64 encoded attachments and content directly from email messages.

If the string to decode is short (less than 84 bytes long) a more direct approach is to use the unpack() function's "u" format after minor transliterations:

        tr#A-Za-z0-9+/##cd;                   # remove non-base64 chars
        tr#A-Za-z0-9+/# -_#;                  # convert to uuencoded format
        $len = pack("c", 32 + 0.75*length);   # compute length byte
        print unpack("u", $len . $_);         # uudecode and print

Note

  This page is made up of entries from the standard perl FAQ, which is part
  of every Perl distribution. It was generated with the command:

  perldoc -q string

  See perldoc.perl.org FAQs for the most up to date version 
  of this and all the other standard Perl documentation.
Revision: [Top]