# Regular Expressons

### Objectives

• Why does Unix use Text files?
• Regular Expressions.
• Regular Expression Definitions
• Regular Expression Usage
• Text processing tools.
• Home Work

### Why does Unix use Text files?

I would love to give you an authoratative answer, but I can’t since I don’t know one. But, I will give you my best guess.

The reason is that they are easy to understand and easy to manipulate. Contrary to what people think, programmers want people to understand how to use their tools. So the easiest way to make something understood is to use text. Remember that the programs are written in text before the compiler work on them. So you need editors to create programs, you need editors to document programs, so why not use the same tools to configure the programs.

Since people communicate in words, why not allow the programs to read text. It is easier to understand what is going on, and allows many different types of input and output devices.

So, the creators of unix build many tools to work with text. The first thing of course was an editor. The grand daddy of the editors was ed . The man page for ed starts out saying:

       ed  is  a  line-oriented  text  editor.  It is used to create,
display, modify and otherwise manipulate text files.  red is a
restricted ed: it can only edit files in the current directory
and  cannot  execute  shell commands.

If  invoked with a file argument, then a copy of file is read
into the editor's buffer.  Changes are made to this copy and
not directly to file itself.  Upon quitting ed, any changes not
explicitly saved  with  a  w' command are lost.

Editing  is  done  in two distinct modes: command and input.
When first invoked, ed is in command mode.  In this mode
commands are read from the standard input and executed to
manipulate the contents  of  the  editor buffer.  

OK, does this sound familiar to anyone? I hope so, these are the modes used in Vi. Now, you might wonder why anyone would use a line editor. The answer lies in the earily terminals. They were basically printers with a keyboard attached. You didn’t move around on the printed page, that was absurbed. Now, don’t get me wrong, I do not want to edit files using ed. It can be done, but is not easy.

Ed included a long list of commands for moving around, finding words, replacing one string with another, etc. The sematics look something like this section of the manual.

       (.,.)s/re/replacement/
(.,.)s/re/replacement/g
(.,.)s/re/replacement/n
Replaces text in the addressed lines matching a regular
expression re with replacement.  By default, only the
first match in each line is replaced.  If the g'
(global)  suffix  is  given,  then  every match  to be
replaced.  The n' suffix, where n is a postive number,
causes only the nth match to be replaced.  It is an
error if no substitutions are performed on any of the
line affected.

re  and replacement may be delimited by any character
other than space and newline (see the s' command
below).  If one or two of the last delimiters is
omitted,  then  the  last  line  affected  is
printed as though the print suffix p' were specified.

An  unescaped  &' in replacement is replaced by the
currently matched text.  The character sequence
\m', where m is a number in the range [1,9], is
replaced by the mth backreference expression of the
matched  text.  If replacement consists of a single
%', then replacement from the last substitution
is used.  Newlines may be embedded in replacement if
they are escaped with a backslash (\).

(.,.)s  Repeats the last substitution.  This form of the s'
command accepts a count suffix n', or any com-
bination  of  the  characters  r', g', and p'.  If a
count suffix n' is given, then only the nth
match is replaced.  The r' suffix causes the regular
expression of  the  last  search  to  be  used
instead  of the that of the last substitution.  The g'
suffix toggles the global suffix of the last
substitution.  The p' suffix toggles the print suffix
of the last substitution The current  address
is set to the last line affected.

(.,.)t(.)
Copies  (i.e., transfers) the addressed lines to after
the right-hand destination address, which may
the last line copied.

u       Undoes the last command and restores the current
address to what it was  before  the  command.   The
global  commands  g',  G', v', and V'.  are treated
as a single command by undo.  u' is its own inverse.

(1,$)v/re/command-list Applies command-list to each of the addressed lines not matching a regular expression re. This is similar to the g' command. (1,$)V/re/
Interactively  edits  the  addressed lines not matching
a regular expression re.  This is similar to the G'
command.

(1,$)w file Writes the addressed lines to file. Any previous contents of file is lost without warning. If there is no default filename, then the default filename is set to file, otherwise it is unchanged. If no filename is specified, then the default filename is used. The current address is unchanged. Not the easiest commands to remember, but it works. The functionality led to the creation of the Line editor ex . Ex is an extremely powerful editor, provided you remember the commands. The next editor which came along was vi . Vi is infact the visual mode of the ex editor. Now ed was never designed to deal with large files. Remember the original computers did not have a lot of ram. So they created a stream editor called sed . Sed uses the same command as ed but it does not read the entire file into memory at any one time. Instead it works on the file line by line. This allows it to run on a computer with 8 meg of memory and edit a file 100 meg in size. All of these editors share a common trait, they use regular expressions to allow them the ability to understand more complex words. ### Regular Expressions First of all, lets recognize that Regular Expression patterns are one of the things which turn people off to Unix. When you see an expression like:  DISPLAY=who am i -l | perl -ne '/$$(.+)$$/ ; print$1'":0.0"

Most people’s eyes start to cross. But lets take this apart to see what it really means.

1. We are using the program who which displays who is logged onto the computer. The who man page starts with.

            /usr/bin/who [ -abdHlmpqrstTu ] [ file ]
/usr/bin/who -q [ -n x ] [ file ]
/usr/bin/who am i
/usr/bin/who am I

The who utility can list the  user's  name,  terminal  line,
login  time,  elapsed  time  since  activity occurred on the
line, and the process-ID of the command interpreter  (shell)
for   each  current  UNIX  system  user.   

We are using the who command with the am i options.

2. Next, the output of the who command is piped into perl to do some string handling. The expression /$$(.+)$$/ grabs the information within the the parentheses. Or in somewhat more verbose english.
• The beginning and ending / start and end the regular expression.
• Then we look for a single opening parentheses with $$. • Then we look for one or more characters with (.+) . • Lastly we are looking for a closing parentheses with$$.
3. The string is then printed with the statement print $1 . 4. This is inclosed in back tics  to cause it to be executed and the output saved as a value. 5. We then append the string :0.0 on to the end of the string. 6. Lastly, we assign the value to the variable DISPLAY . Now, isn’t that as clear as mud!?! Ok lets take a look at a man page describing regular expressions. source: perldoc perlre  Regular Expressions The patterns used in Perl pattern matching derive from supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) See the Version 8 Regular Expressions entry elsewhere in this document for details. In particular the following metacharacters have their standard egrep-ish meanings: \ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline)$   Match the end of the line (or before newline at the end)
|   Alternation
()  Grouping
[]  Character class

By default, the "^" character is guaranteed to match only
the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$".  You may, however, wish to treat a
string as a multi-line buffer, such that the "^" will
match after any newline within the string, and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting "$*", but this practice is now deprecated.)

To simplify multi-line substitutions, the "." character
never matches a newline unless you use the "/s" modifier,
which in effect tells Perl to pretend the string is a
single line--even if it isn't.  The "/s" modifier also
overrides the setting of "$*", in case you have some (badly behaved) older code that sets it in another module. The following standard quantifiers are recognized: * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times (If a curly bracket occurs in any other context, it is treated as a regular character.) The "*" modifier is equivalent to "{0,}", the "+" modifier to "{1,}", and the "?" modifier to "{0,1}". n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms.  Another piece that is commonly used is the square bracket [ ] . They are defined as follows in the man page for regex source: man 7 regex  A bracket expression is a list of characters enclosed in []'. It normally matches any single character from the list (but see below). If the list begins with ^', it matches any single character (but see below) not from the rest of the list. If two characters in the list are sepa- rated by -', this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g. [0-9]' in ASCII matches any decimal digit. It is illegal for two ranges to share an endpoint, e.g. a-c-e'. Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them. To include a literal ]' in the list, make it the first character (following a possible ^'). To include a lit- eral -', make it the first or last character, or the sec- ond endpoint of a range. To use a literal -' as the first endpoint of a range, enclose it in [.' and .]' to make it a collating element (see below). With the excep- tion of these and some combinations using [' (see next paragraphs), all other special characters, including \', lose their special significance within a bracket expres- sion. Within a bracket expression, a collating element (a char- acter, a multi-character sequence that collates as if it were a single character, or a collating-sequence name for either) enclosed in [.' and .]' stands for the sequence of characters of that collating element. The sequence is a single element of the bracket expression's list. A bracket expression containing a multi-character collating element can thus match more than one character, e.g. if the collating sequence includes a ch' collating element, then the RE [[.ch.]]*c' matches the first five characters of chchcc'. ### Regular Expression Definitions OK Lets start by looking at definitions for the terms used in Regular expressions. Here is a web page which does a good job of defineing the terms. Regular expressions is the term used for a codified method of searching ‘invented’ or ‘defined’ by the American mathematician Stephen Kleene. What is the definition of regular expressions. Lets look at a formal definition: regex7 ### Regular Expression Usage Now that we have played with some of the definitions, lets get down to seeing how to apply some of these ideas. Here is an article I found on the web about using Regular Expressions with Web pages. So What’s A$#!%% Regular Expression, Anyway?!

OK now that we have seen these uses, lets try looking at some tutorial information on regular expressions.

This comes from the Rute tutorial, which I highly recommend. 5. Regular Expressions

### Text processing tools

The Unix system’s includes many text processing tools. A few of them are diff, aspell, indent, less, cat, cut, sort, cksum, comm, csplit, expand, fmt, fold, head, join, md5sum, nl, od, paste, pr, ptx, sha1sum, split, sum, tac, tail, tr, tsort, unexpand, uniq, wc, sed, awk, egrep, sort, and perl. Lets take a look at what these tools are good at.

#
# tac: cat backwards; reverse the order of lines in a file
#
awk '{print NR "#" $0}' "$@" |
sort -t# +0nr -1 |
sed 's/^[0-9]*#//'`

### Homework

I would like you to take a look at the man pages for egrep, sed, and awk. Create at least two shell scripts for each tool demonstrating how to use them on these text files.comic.txt or baseball.txt

Written by John F. Moore

Last Revised: Mon Jan 16 15:37:34 EST 2017