So What's A $#!%% Regular Expression,
    Anyway?!
    
    By Vikram
    Vaswani and Harish Kamath
    
    April 12, 2000
    
    
    Printed from DevShed.com
    
    URL: http://www.devshed.com/Server_Side/Administration/RegExp
    
    
    Introduction
    
    Ask any relatively-experienced *NIX user to list his top ten
    favorite things about the operating system, and you're almost
    certain to hear him mutter, somewhere between "99% uptime" and
    "remote system reboots", the phrase "regular expressions".
    
    
    Ask any relatively-experienced *NIX user to list the ten things he
    hates most about the operating system, and somewhere between
    "zombie processes" and "installation", he's sure to spit out the
    phrase "regular expressions".
    
    
    It's precisely this complex love-hate equation that spawned the
    idea for a tutorial on regular expressions - surely, went the
    reasoning, something that induced such strong emotions in normally
    hard-headed *NIX administrators was worthy of investigation. And
    so, regardless of whether you're new to regular expressions, or an
    old hand at putting them together, the next few pages should help
    you resolve your conflicted feelings on the subject. Hey - it *is*
    cheaper than therapy...
    
    
    And First There Was
    Love...
    
    Regular expressions, also known as "regex" by the geek community,
    are a powerful tool used in pattern-matching and substitution. They
    are commonly associated with almost all *NIX-based tools, including
    editors like vi, scripting languages like Perl and PHP, and shell
    programs like awk and sed. You'll even find them in client-side
    scripting languages like JavaScript - kinda like Madonna, their
    popularity cuts across languages and territorial boundaries...
    
    
    A regular expression lets you build patterns using a set of special
    characters; these patterns can then be compared with text in a
    file, data entered into an application, or input from a form filled
    up by users on a Web site. Depending on whether or not there's a
    match, appropriate action can be taken, and appropriate program
    code executed.
    
    
    For example, one of the most common applications of regular
    expressions is to check whether or not a user's email address, as
    entered into an online form, is in the correct format; if it is,
    the form is processed, whereas if it's not, a warning message pops
    up asking the user to correct the error. Regular expressions thus
    play an important role in the decision-making routines of Web
    applications - although, as you'll see, they can also be used to
    great effect in complex find-and-replace operations.
    
    
    A regular expression usually looks something like this:
    
      
      
/love/
      
    
    All this does is match the pattern "love" in the
    text it's applied to. Like many other things in life, it's simpler
    to get your mind around the pattern than the concept - but then,
    that's neither here nor there...
    
    
    How about something a little more complex? Try this:
    
      
      
/fo+/
      
    
    This would match the words "fool", "footsie" and
    "four-seater". And although it's a pretty silly example, you have
    to admit that there's truth to it - after all, who but fools in
    love would play footsie in a four-seater?
    
    
    The "+" that you see above is the first of what are called
    "meta-characters" - these are characters that have a special
    meaning when used within a pattern. The "+" metacharacter is used
    to match one or more occurrence of the preceding character - in the
    example above, the letter "f" followed by one or more occurrence of
    the letter "o".
    
    
    Similar to the "+" meta-character, we have "*" and "?" - these are
    used to match zero or more occurrences of the preceding character,
    and zero or one occurrence of the preceding character,
    respectively. So,
    
      
      
/eg*/
      
    
    would match "easy", "egocentric" and "egg"
    
    
    while
    
      
      
/Wil?/
      
    
    would match "Winnie", "Wimpy" "Wilson" and
    "William", though not "Wendy" or "Wolf".
    
    
    In case all this seems a little too imprecise, you can also specify
    a range for the number of matches. For example, the regular
    expression
    
      
      
/jim{2,6}/
      
    
    would match "jimmy" and "jimmmmmy!", but not
    "jim". The numbers in the curly braces represent the lower and
    upper values of the range to match; you can leave out the upper
    limit for an open-ended range match.
    
    
    Of Carrots, Bombshells And Four-Figure
    Incomes
    
    Now that you've got the basics down, how about taking it to the
    next level? It's also possible to search for white space, numbers
    and alphabetic characters with a regular expression - and here's
    the merry gang of meta-characters that will help you do just that:
    
    
    \s = used to match a single white space character, including tabs
    and newline characters
    
    
    \S = used to match everything that is *not* a white space character
    
    
    \d = used to match numbers from 0 to 9
    
    
    \w = used to match letters, numbers and underscores
    
    
    \W = used to match anything that does not match with \w
    
    
    . = used to match everything except the newline character
    
    
    Now, you're probably thinking, "Hey, that's great - but what does
    it all mean?!". Well, suppose you wanted to find all the white
    space in a document...
    
      
      
/\s+/
      
    
    Easy, isn't it? If you're looking only for
    numbers, try
    
      
      
/\d/
      
    
    So, if you had a complex financial spreadsheet in
    front of you, and you wanted to quickly find all amounts of a
    thousand dollars or more, you could use
    
      
      
/\d000/
      
    
    How about limiting your search to the beginning or
    end of a string? Well, that's why we have "pattern anchors" - these
    simply tie your regular expression to either the first or last
    character of the string, and come in very useful when you're
    looking for a way to filter through a mass of matches.
    
    
    There are two basic pattern anchors - the first one is represented
    by a caret [^], and is used to indicate that the expression should
    be matched only at the beginning of the string that it is applied
    to. For example, the expression
    
      
      
/^hell/
      
    
    will return a match only if it finds a word
    beginning with "hell" - "hello" and "hellhound", but not "shell".
    
    
    And similarly, to match the end of a string, there's the "$"
    pattern anchor. So
    
      
      
/ar$/
      
    
    would match "scar", "car" and "bar", though not
    "art", "army" or "arrow".
    
    
    There's also a simpler way to add pattern anchors to your
    expression - the \b meta-character. This is used to check that the
    regex matches the boundary of a string, and it can be placed either
    at the beginning or end of the pattern to be matched - like
    this:
    
      
      
/\bbom/
      
    
    This would match both "bombay" and "bombshell",
    while
    
      
      
/man\b/
      
    
    would match "human", "woman" and "man", though not
    "manitou" or "mannequin". And the converse of this is \B, which
    matches everywhere but at the boundaries of a string.
    
    
    Ranging Far And Wide...
    
    Just as you can specify a range for the number of characters to be
    matched, you can also specify a range of characters. For example,
    the range
    
      
      
/[A-Z]/
      
    
    would match a single instance of all upper-case
    alphabetic characters, while
    
      
      
/[a-z]/
      
    
    would match all lowercase letters, and
    
      
      
/[0-9]/
      
    
    would match all numbers between 0 and 9.
    
    
    Using these three ranges, it's pretty easy to create a regular
    expression to match an alphanumeric field.
    
      
      
/([a-z][A-Z][0-9])+/
      
    
    would match a string that was purely alphanumeric
    in nature, like "aB0" - although not "abc". Note the parentheses
    around the patterns - contrary to what you might think, these are
    not there purely to confuse you; they come in handy when grouping
    sections of a regular expression together.
    
    
    Choice is very important when building regular expressions - as in
    most other languages, it's possible to use the pipe [|] operator to
    indicate multiple options in a regex. For example,
    
      
      
/to|too|2/
      
    
    would match any one of the three strings "to",
    "too" and "2". As you can imagine, this comes in pretty useful when
    building expressions that have many possible variants.
    
    
    You can also invert the regular sense of a regular expression with
    the negation operator, represented by a caret [^] - so the
    pattern
    
      
      
/[^A-C]/
      
    
    would match everything but that which appears in
    the expression - namely, everything except the letters "A", "B" and
    "C". Note how the caret, when used in a bracketed expression, is
    used to invert the match; this behaviour is different from when it
    is used outside a bracketed expression, where it serves as a
    pattern anchor.
    
    
    And finally, one important thing to remember - should you decide,
    for reasons best known to you and your mental health specialist, to
    add any of the meta-characters described above to your pattern and
    explicitly match them, you need to "escape" then with a back slash
    [\]. So, the pattern
    
      
      
/Th\*/
      
    
    would match "Th*" but not "The" - the \* ensures
    that the asterisk is matched as a literal character, not a
    meta-character.
    
    
    How To Say "Ummmm...." In Three Different
    Languages
    
    Now that we've got all that out of the way, let's take a closer
    look at some examples of how regular expressions are used in Perl,
    PHP and JavaScript. In Perl, for example, you can perform some
    pretty advanced pattern matching using both the rules you've
    already learnt, and some Perl-specific additions.
    
    
    A pattern-matching command in Perl usually looks like this:
    
      
      
operator / regular-expression / string-to-replace / modifiers
      
    
    Let's take a closer look at each of these
    components.
    
    
    The operator can either be an "m" or an "s", depending on the
    purpose of the regular expression -"m" is used for "match"
    operations only, while "s" is used for "substitution" operations.
    
    
    The regular expression is the pattern that is to be matched. This
    pattern can be constructed using a variety of characters,
    meta-characters and pattern anchors.
    
    
    The string to replace is...well, the string to be replaced in a
    find-and-replace operation. Yeah, every once in a while, we slip
    you an easy one.
    
    
    Finally, the modifiers are used to control the manner in which a
    particular regex is applied. There are a whole bunch of modifiers,
    some of them with pretty exotic names; unfortunately, none of them
    are single, or interested in going out to dinner with you.
    
    
    So, the statement
    
      
      
s/love/lust/
      
    
    would replace the first occurrence of the word
    "love" with "lust". And if you wanted to perform a global
    search-and-replace operation, you'd use the "g" modifier, like
    this
    
      
      
s/love/lust/g
      
    
    And they say romance is dead!
    
    
    You can also use case-insensitive pattern matching - simply add the
    "i" modifier, as in the following example, and watch in awe as Perl
    matches "jewel", "Jewel" and "JEWEL".
    
      
      
m/JewEL/i
      
    
    In Perl, all interaction with regular expressions
    takes place via an equality operator, represented by =~; this is
    used as follows.
    
      
      
$flag =~ m/abc/
      
    
    $flag returns true if $flag contains "abc"
    
      
      
$flag =~ s/abc/ABC/
      
    
    replaces abc in the variable $flag with ABC And
    here's an example of a simple Perl program which asks for your
    email address, and compares it with a regex to verify whether or
    not it's in the correct format.
    
      
      
#!/usr/bin/perl
# get input
print "So what's your email address, anyway?\n";
$email = <STDIN>;
chomp($email);
# match and display result
if($email =~ /^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+/)
        {
     print("Ummmmm....that sounds good!\n");
        }
else
        {
     print("Hey - who do you think you're kidding?\n");
        }
      
    
    As you can see, the most important part of this
    program is the regular expression - it's been dissected
    below:
    
      
      
^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+
      
    
    The first part
    
      
      
^([a-zA-Z0-9_-])
      
    
    matches the username part of the email address -
    this could be either a number, a character, or a combination of
    both.
    
    
    This is followed by an @ symbol, which is followed by the domain
    part of the address; this could again include letters or numbers,
    and uses a period as a delimiter - not our usage of an "escaped"
    period and the "+" meta-character to represent these conditions in
    the second half of the expression
    
      
      
([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+
      
    
    Obviously, this is simply an illustrative example
    - if you're planning to use it on your Web site, you need to refine
    it a bit. For example, the script above won't accept email
    addresses of the form firstname.lastname@somedomain.com - although
    such addresses are also pretty common on the Web. You have been
    warned!
    
    
    If you prefer PHP to Perl, you need to use the ereg() function for
    all pattern matching operations,this usually takes the
    format
    
      
      
ereg(pattern, string)
      
    
    where "pattern" is the pattern to be matched, and
    "string" is the character string to be searched for the pattern.
    The next example should illustrate this a little more
    clearly:
    
      
      
<?php
if (ereg("^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+",$email))
{
        echo "Ummmmm....that sounds good!";
}
else
{
        echo "Hey - who do you think you're kidding?";
}
?>
      
    
    And finally, JavaScript. JavaScript 1.2 comes with
    a powerful RegExp() object, which can be used to match patterns in
    strings and variables. The important thing here is the test()
    method, which searches for a pattern in a string or variable, and
    returns either true or false - it's illustrated in the example
    below.
    
      
      
<html>
<head>
<script language="Javascript1.2">
<!-- start hiding
function verifyAddress(obj)
{
// obtain form value into variable
        var email = obj.email.value;
// define regex
        var pattern = /^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+/;
// test for pattern
        flag = pattern.test(email);
        if(flag)
        {
                alert("Ummmmm....that sounds good!");
                return true;
        }
        else
        {
                alert("Hey - who do you think you're kidding?");
                return false;
        }
}
// stop hiding -->
</script>
</head>
<body>
<form onSubmit="return verifyAddress(this);">
<input name="email" type="text">
<input type="submit">
</form>
</body>
</html>
      
    
    Obviously, there's a whole lot more that you can
    do with regular expressions - checking email addresses is just the
    tip of the iceberg. You can use regular expressions to validate
    phone numbers, currency figures, Web site URLs, and a whole lot
    more - all you need is a little bit of creativity and patience, a
    few slices of leftover pizza...and a therapist who cares.
    
    
    
    Note: All program code and examples in this article have been
    tested on Linux 2.2.13/i386 with Perl 5.004, PHP 3.0.9 and
    Javascript 1.2.
    
    
    This article copyright Melonfire 2000-2002. All rights
    reserved.