————————————————————————————————————
Lesson 1: Introduction to grep
------------------------------------------------------------------------


I. BRIEF BACKGROUND

In UNIX, all files (regardless of filetype, ASCII or binary, for
instance) consist of lines and terminated by an End-of-File (EOF).
Each line is terminated by “\n” (ASCII value 42 or 2A). In DOS text
files end with “\r\n” where “\r” is ASCII 45 or 2D. While \n
contributes to the byte count of the file it is not a part of any
line.

If you inspect "bash" (or csh, korn or any other shell) scripts you
will find a liberal use of "grep", "sed" and "awk". A good bash
programer needs to be an expert with these utilities. Separately,
they are fun to use: terse and  curiously addictive.  We start off
by learning how to use "grep". The simplest use of grep is to search
for a fixed pattern.

grep arose from "ed", the first line editor in UNIX. It stands for
"Global search for REgular expressions and Print matching lines".
Before proceeding further as I noted above a UNIX file consists of
lines and such files are well suited to sed and awk, In contrast,
your Windows emailer works best with a line per paragraph.  grep,
sed and awk are not terribly useful if your file consists of one
long line!

Before proceeding I have to tell you that grep or sed, cannot,
ordinarily, search for \n. It is worth remembering this basic point.

It is my experience that the fastest way to learn a programming
language is via examples. Therefore, starting the next section I
urge you NOT to read but "do it".

Note below you type everything but not the "$".  This is the "$"
supplied by bash. The text following "#" is usually an explanation
of a result and is not the result of output of any program.
Occasionally lines starting with "#" are pure comments lines.  Also
the end-of-file signal is given by "control D" (spelt as cntrl D)

NB: I will assume that you are familiar with the following UNIX
features: > (redirection), | (pipe), control D (end of file), “ls”,
“cat” and the “echo” commands.

------------------------------------------------------------------------
IB. First example
------------------------------------------------------------------------

The simplest format of grep is
	grep pattern inputstream
where pattern can be unquoted (i.e. the search pattern spelt in the
usual fashion) or quoted weakly ("...") or quoted strongly ('...').
For now we will start with no quoting and then gradually learn when
quoting is necessary (and when a particular type of quoting is
necessary.

Below we use the UNIX echo command to feed an input line to grep.

$ echo hello 
hello

grep reads input and returns the line if a match is find. No line
is returned if no match is found

$ echo hello | grep e -   #formally the input is the keyboard
hello                      #match

$ echo hello | grep e      #this informal approach also works 
hello                      #match

$ echo hello | grep z
		           #no match, no line returned

However, typically grep is used with an input file rather than 
a stream of characters coming from a keyboard. [All input files
used in this lesson can be found this sub-directory].

$ cat > SimpleInput.dat       
1 hello, parvi rabbit
2 there are many types of rabbits, ordinary rabbits
3 and rabbits that have been bred over centuries. 
4 Rabbits have been around, apparently, for at least
5 three million years. 
cntrl D                        


$ grep rabbit SimpleInput.dat
1 hello, parvi rabbit                                 #exact match
2 there are many types of rabbits, ordinary rabbits   #match, even if plural
3 and rabbits that have been bred over centuries.   

Before I explain the result I remark that we are used to thinking
in terms of "words". A word is (usually) a collection of pure
alphabets preceded by a "blank" (which is usually a space character
or perhaps a tab) and succeeded by a blank or a character from the
set of punctuations (.;,?"'). When writing or thinking about UNIX
abandon your training in English.

grep (and other UNIX utilities) operate on a literal basis.  In 
the example above we asked grep to search for "rabbit". It found 
matches to "rabbit on line 1, 2 and 3. You are comfortable with 
the match on line 1 but are likely not comfortable (or puzzled) 
by the matches reported on lines 2 and 3.

The fact is that "rabbits" is also a match to "rabbit" in a literal
sense. If you wanted to search exclusively for the word rabbit then
you should say "find a pattern which is 'rabbit’ AND has a space
preceding it AND a space or punctuation succeeding it". Even this
may not be an air-tight prescription since it will miss the sentence
which starts off  "rabbit" (in this case the preceding character
is a null character — for which there is no ASCII representation).
Another possibility is that the word is enclosed by "()" or double
quotes or single quotes. So you can see that an accurate and
comprehensive specification to search for the word "rabbit" is not
trivial.

Next, note that grep searches a line and upon finding the firs match
stops the search and prints the line. It then proceeds to the next
line For this reason, in the above example, line 2 is printed only
once.

The example above was perhaps the simplest possible search. grep
has powerful search capability. You can specify a pattern via a
framework and grep will find all patterns that match the framework.
For that you have to learn "regular" expressions.

------------------------------------------------------------------------
II. ANCHORS: ^, $
------------------------------------------------------------------------

Regular expressions (like pipes, universal file construct) lie at
the heart of UNIX. There are many families of regular expression
("regexp"). We will start with the earliest and most basic regular
expression (BRE).

As noted above all UNIX lines consist of lines followed by an EOF.
Lines have a beginning (null) and an end (\n which has an ASCII
representation). Clearly, it makes sense to start with markers for
the beginning and end of the line.

The characters  “^” and “$” are anchors (positional indicators) for
the start and end of line. These two characters are called as
"meta-characters" because they have extra-ordinary power (unless
"escaped"; more on this later).

I have to warn you that the rest of this section is not easy. You
will find that your intuition will, almost always, be wrong. It is
worth doing the exercises several times whilst synthesizing the
results.  Ideally, you should grasp this section fully proceeding
ahead. However, likely you may find that this is not possible. If
so, simply go ahead and keep revisiting this section!


$ cat InputRegEx.dat
hello, parvi
% hello, other rabbits
.hello, kitty, &
good morning, world #
sayonara wall street$

	#Now we undertake a variety of searches

$ grep ^h InputRegEx.dat       #identify lines that begin with "h"
hello, parvi rabbit


$ grep ^% InputRegEx.dat       #identify lines that begin with "%"
% hello, rabbits

$ grep parvi$ InputRegEx.dat  #find lines ending with parvi
hello, parvi


	#For pedagogical purpose I list some constructs which 
	#are useless because they list all lines in the input file!
	#Understand why this is the case
$ grep $ filename   #Why? 'cause all lines have an ending 
$ grep ^ filename   #Why? 'cause all lines have a beginning 

	#the following is useless since it does not list any line!
$ grep A^ filename    # if A is any character other than "["

Please ponder the above command and convince yourself why there is
no match. If you are not able to figure it out then talk to me.

Additional Literature: The wikipedia has an excellent & comprehensive
review on regular expressions.
https://en.wikipedia.org/wiki/Regular_expression The article may
be appreciated after you finish taking this class.


------------------------------------------------------------------------
IIB. First introduction to Quoting 
------------------------------------------------------------------------

So far our patterns have not been “quoted”. We will find that
sometimes you HAVE to quote. To illustrate this let us say that our
goal is to find lines ending with "#".

$ grep #$  InputRegEx.dat    #command fails 
			     # ... "GREP1/grep does not exist"

The reason this command failed is that "#" is a special character
for the shell (“comment”).  Characters succeeding # are not read
by the shell.  The “do not read” feature lasts for the rest of the
line. Quoting informs the shell that "#" should not be interpreted
in the usual fashion.

		#Either quoting works
$ grep '#$' InputRegEx.dat
$ grep "#$" InputRegEx.dat

For this reason, most users simply start quoting all patterns. There
will be a more detailed and finer distinction between double and
single quotes.


Curious? Try

$ grep '$A' InputFile   #Will not return any hit regardless of what A is


------------------------------------------------------------------------
IIC. Converting meta-characters to literals
------------------------------------------------------------------------

		#Goal: you want to list lines beginning with "."

$ grep '^.' InputRegEx.dat    #Naively you try the obvious approach  
hello, parvi rabbit             
% hello, rabbits                
.hello, kitty, &                
good morning, world #           
sayonara wall street$

Why did all lines get listed? All lines have a beginning, “^”. Most
lines have at least one character. In regexp, as explained below (IV) 
“.” is a meta-character. It denotes one character (any character). 
Thus the pattern ^. is satisfied by lines which have one
or more characters.  In the above example, all lines satisfy this
criterion.

Clearly, we need to inform grep that we are genuinely looking for
the period or “.” character. So we need to strip “.” of its special
meaning. We do so by “escaping” it and that is “\,”.


$ grep '^\.' InputRegEx.dat 	
.hello, kitty           #It worked!

Do not fret too much if you do not understand this example. Below
we have an entire section (VII) devoted to this painful issue.


Curious? Try

$ grep ^\. InputRegEx.dat   #will print the entire file. Why?
hello, parvi
% hello, other rabbits
.hello, kitty, &
good morning, world #
sayonara wall street$			
	
[This is an idiosyncrasy that one should get used to! 
Lesson: quote the pattern].

------------------------------------------------------------------------
III.  INTERLUDE: Line numbering, showing non-printable characters
------------------------------------------------------------------------

Since the UNIX framework for files is a series of lines it is only
natural to have many utilities number each line. Numbering lines
is extremely useful for identifying errors in bugs and identifying
lines of interest when using line oriented editors like vi (or emacs
for that matter).

$ cat  Lines.dat
a
aa
b
ba
       #To make this “line with zero characters” simply hit return

$ cat -e Lines.dat        #with option "-e" control chars are revealed
aa$                       #"\n" is displayed as "$"
b$
ba$
$                         #Notice this line. It is a null line (no chars)

$ cat -e -n Lines.dat     # the option "-n" results in printing of input
     1	a$                 #line number 
     2	aa$
     3	b$
     4	ba$
     5	$                   #make sure that there is no space preceding $. 

Incidentally, if you actually wanted to insert lines into a file
that you have already constructed then you can do so by in at least
a dozen ways.  The classic method is

$ nl Lines.dat > LinesNumbered.dat   

Note that ordinarily nl does not number empty lines. If you want to
number all lines then

$ nl -ba Lines.dat > LinesNumbered.dat


------------------------------------------------------------------------
IIIb. Options 
------------------------------------------------------------------------

All UNIX utilities have "options". grep has many options. It is worth 
reading the “man” pages (just to appreciate the richness of grep).
Below we will use the option “-n” (for numbering output lines using input
line number)

$ grep -n ‘’ Lines.dat   


------------------------------------------------------------------------
IV. Quantification: *  .  \{ \}
------------------------------------------------------------------------

The regular expression "."  means match one character (any character)

	
$ grep -n "." Lines.dat    #list all lines which contain one character 
1:a                       
2:aa                     
3:b
4:ba

As you can see, line 5 is not in the output (because it is a null line;
just \n).

$ grep -n "a" Lines.dat     #Search for lines containing "a"
1:a                         
2:aa
4:ba

$ grep -n “^a" Lines.dat        #Search lines starting with "a"
1:a
2:aa


The regular expression “C*” maps to  (1) one or more of the character 
“C” (where C can be any character) OR (2) a character of zero length. 
The rookie mistake is to forget (2). Note that “*” applies to the single
character preceding it. Thus “ba*” maps to “b”, “ba”, “baa” and so on.


$ grep -n 'a*' Lines.dat   #Any such command will match ALL lines
1:a                        #Why? (Think hard and only proceed after you have 
2:aa                       #figured out the answer
3:b
4:ba
5:                              #Notice even this null line is also matched

$ grep -n 'aa*' Lines.dat       #matches one or more "a"
1:a
2:aa
4:ba

$ grep -n 'a\{2\}' Lines.dat     #Match two “a” in succession
2:aa

$ grep -n 'a\{1,2\}' Lines.dat   #Match "aa" or "a"
1:a
2:aa
3:ba

$ grep -n 'a\{1,\}' Lines.dat     #Matches “a”, “aa”, “aaa” and so on
1:a
2:aa
4:ba

ROOKIE MISTAKE GIVES GREAT UNDERSTANDING:
Now I address the usual rookie mistake. Review the command below.
Do not execute command.  First mentally compute your answer. Then 
execute the command and explain the rather surprising answer. 
Commit this example to memory.

$ echo "abc"  | grep 'z*'
abc

------------------------------------------------------------------------
IVB. Introduction to sed
------------------------------------------------------------------------
I take the occasion to introduce "sed". No need to get stressed out! 
We will have two classes on sed. The usual invocation of sed is

$ sed action InputFile

The most famous action is to "identify and replace" (substitution).
Say you have written a story (file: Storybook) in which the main 
protagonist is a dog. After finishing and upon reflection you decide 
to change the protagonist to a rabbit (naturally).

$ sed 's/dog/rabbit/' Storybook  #this is not a foolproof substitution. 
                                 # ”dogged” becomes “rabbitged”.
					  $There is a simple way to fix such errors.


$ echo "abc" | sed 's/a/A/'   #will substitute A for a, but for the first match
Abc

$ echo "abca" | sed 's/a/A/g'   #will substitute A for a, for all matches
$AbcA                           #the “flag” “g” stands for “global” (all matches)

Having learnt one (but powerful) command in sed let us redo the rookie
mistake exercise.

$ echo "abc" | sed 's/b*/1/'
1abc

$ echo "abc" | sed 's/b*/1/g'
1a1c1

Please ponder the outcomes!

------------------------------------------------------------------------
V. CHARACTER SET:
------------------------------------------------------------------------

A character set allows for a set of characters to be probed instead of 
only one. 

[abc] would match "a", "b" or "c"

>> Remember a character set is still only one character <<
[If you want more than one character then you need to concatenate 
as in “[abc][xz]”. This pattern is matched by six combinations:
ax, az, bx, bz, cx and cz.]

Ranges are allowed (they map to underlying continuous range ASCII values;
if this statement is not clear look up the ASCII table):
 
For instance [0-9] is a character set which includes all digits
from 0 to 9.  By induction
[a-z]  a through z (lower case)
[A-Z]  A through Z (upper case)
[a-d]  a through d 

Ranges can be combined:

[0-9a-z] This pattern is matched by digits and lower case alphabets.
[a-zA-Z] This pattern is matched by alphabets, independent of case

There is no ambiguity in the above construction. The first “-“
specifies the range a through z and the second dash specifies the
range A through Z.

Adding “^” as the first character in a set negates the entire set.
For example, [^0-9] means all characters BUT the 10 digits. [It is
unfortunate that “^” has another meaning; see II].

$ cat  Lines2.dat
a
b
bc
def
1
345
&%^?-+

$ grep '[a-z]' Lines2.dat
a
b
bc
def

$ grep '[0-9]' Lines2.dat
1
345

$ grep '[A-Z]' Lines2.dat

$ grep '[^a-z]' Lines2.dat           
1
345
&?-+

$ grep '[0-9a-z]' Lines2.dat        
a
b
bc
def
1
345

$ grep [^0-9a-z] Lines2.dat          #example of negating two ranges
&?-+


------------------------------------------------------------------------
VB.  POSIX Character classes
------------------------------------------------------------------------

Rather than write [0-9] or [a-zA-Z] we can use POSIX character classes
[:digit:] any digit, 0-9
[:alpha:] any alphabe (upper & lower), A-Z & a-z
[:alnum:] any alpha or numeric character
[:cntrl:] control characters
[:print:] any printable characters (i.e [^[:cntrl:]])
[:blank:] a space or tab
[:graph:] neither a space nor a tab

POSIX classes can be used only within the character set framework.
Thus the correct usage is always “[[:digit:]]”. 

$ grep '[[:digit:]]' Lines2.dat    #selects lines with digits 
1
345

$ grep [^[:digit:][:alpha:]] Lines2.dat   #selects non-alphanumeric chars
&?-+

COMMON ROOKIE MISTAKE:
It is a common mistake to use “[:digit:]” instead of “[[:digit:]]. 
Study (and understand) the rather curious outcome

$ grep '[:digit:]' Lines2.dat     
def                               


------------------------------------------------------------------------
VI. Back-referencing & Extracting Tokens
------------------------------------------------------------------------

The construct "\(RegularExpression\)" yields a token. The text thus
captured is available in “\1”. Nine such tokens can be defined (in
sequential order).  Thus for example "\(.\)" captures a character.

The simplest use of back-referencing is to search for a repeated
pattern in the same line. Using back-reference saves retyping the 
pattern again. This is not the best use of back-references 
(actually it slows down the performance).

$ grep '\(sky is blue\).*\1' BlueSky.dat
The sky is blue at 10 am. The sky is blue at noon.

A better example is to extract quoted words
$ echo 'Hello, my name is "Kitty"' | grep "\([\"']\).*\1"

PAUSE: You should pause here and understand the quoting construction 
here.The basic rule when you wish to use either quoting then you should
build inside out.  Start with the quoting for echo == 'Hello, my
name is "Kitty"'.  Here I wanted to quote Kitty with double quotes.
Ergo, the outer quoting has to be single quote.

Next the pattern for grep. Here I wanted to include both " and '
inside the pattern. Given that we want both quotes I arbitrarily 
chose " as the outer quote. Ergo, the double
quote " must be escaped inside the patter.

Having understood the quoting let us study the pattern "\([\"']\).*\1"
The pattern is designed to find all characters enclosed between 
double quotes or single quotes. Say a double quote is found as grep scans from
left to right. Then grep looks for a closing double quote and not
a single quote. The same argument applies for a single quote. 
This is indeed a clever use of back-referencing. 


A fun use of back-reference is to identify palindromes. In this
case the pattern to be identified is unknown. The expression "\(.\).\1"
accurately describes  3-character palindrome, e.g. “ANA”, “dud”, “eve”, “you” 
and so on. Armed with tokens let us search for five character palindromes!

$ cat Palindrome.dat
1:In the city of GadaG there lived two dogs whose owner was on Mr. TenteT. 
2:One was called RotoR and the other SoloS. 
3:One day one dog ran away and other dog got hit by a bus.
4:SoloS and the owner became very sad.

$ grep -n '\(.\)\(.\).\2\1' Palindrome.dat
1:In the city of GadaG there lived two dogs whose owner was on Mr. TenteT. 
2:One was called RotoR and the other SoloS. 
3:One day one dog ran away and other dog got hit by a bus.
4:SoloS and the owner became very sad.

The identification of GadaG and RotoR is clear. Incidentally, TeneT
and SoloS were not identified by grep. The search stopped
when a match was found. grep printed the line and went to the next line.
[More below].  

>> You may wish to figure out why line 3 is reported <<

Our third class will be on “sed”. However, modest use of sed will give
insight into the above exercise. Here is a brief summary of how
the substitution by sed works.

$ sed ’s;Reg;Replace;' InFile 

replaces text found by regular expression, Reg, by text given by Replace.

Below we will identify palindromes and replace them by “Z” (as a marker).

$ sed 's;\(.\)\(.\).\2\1;Z;' Palindrome.dat 
In the city of Z there lived two dogs whose owner was on Mr. TenteT. 
One was called Z and the other SoloS. 
One day one dog ran away and other dZt hit by a bus.
Z and the owner became very sad.


------------------------------------------------------------------------
VII. How to search for meta-characters
------------------------------------------------------------------------

The meta-characters for basic regular expressions (BRE) are
	^
	$
	.
	*
	[
	]
	-
	& 	
A meta-character is downgraded to an ordinary character by escaping
it.  That is when you wish to use any of the above characters
literally then precede it by “\”. We have one such example in IIC.
However, not all meta-characters need to be escaped (for sound
reasons). Though it is a technical point it is not prudent to simply
escape ALL meta-characters (though you can get away by escaping all
of them; it is not POSIX compliant).


$ cat Metacharacter.dat
We explore meta-characters here: *, ., ^, $, [, ] & -.
We have not introduce & so far.
                        #this is a null line (see III)
.Some UNIX tools have line starting with a .
In regular English no sentence ends in $


$ grep '\$' Metacharacter.dat    #Identify lines with a “$” character
We explore meta-characters here: *, ., ^, $, [, ] & -.
In regular English no sentence ends in $

$ grep ']' Metacharacter.dat 
We explore meta-characters here: *, ., ^, $, [, ] & -.   #surprised?

$ grep '[' Metacharacter.dat
grep: brackets ([ ]) not balanced                      #hmm

$ grep '\[' Metacharacter.dat                          #solution 1
We explore meta-characters here: *, ., ^, $, [, ] & -.

$ grep '\^' Metacharacter.dat      #find lines with “^” character 
We explore meta-characters here: *, ., ^, $, [, ] & -.

You can deepen your understanding of the regular expression framework
by exploring character sets containing meta-characters.  To start
with, meta-characters lose their special status when inside a
character set. However, how you place them inside “[  ]” does matter.

$ grep '[]]' Metacharacter.dat                      
We explore meta-characters here: *, ., ^, $, [, ] & -.

$ grep '[^]' Metacharacter.dat                #hmmm
grep: brackets ([ ]) not balanced

$ grep  '[]*]' Metacharacter.dat
We explore meta—characters here: *, ., ^, $, [, ] & -.

$ grep '[*]]' Metacharacter.dat
does not return any line. Why?

------------------------------------------------------------------------

HOMEWORK:

[1] Look at the file "BlankLines.dat". It has a variety of blank lines.
Use "grep -n" and identify (and count) all blank lines. Then do the
same with a finer search (space, tab, null).

Hint: "tab" is a control character and the way you pass this character
to a pattern is "cntrl V tab".

[2] Devise a way to search for words defined as follows: a word
is defined to consists only of lower case alphabets except the
first alphabet which can be capital. The word can be preceded
by a blank, tab or null (start of aline). It can end with a blank,
space or punctuations (,;.?:!). The word can be enclosed in single
or double quotes.

[3] The utility “nl” has some interesting options. 

$ nl -bp’Regexp’Infile

will number only those lines with regular expression specified by Regexp.
Compare that with “grep -n”