———————————————————————————————————— Lesson 1: Introduction to grep ------------------------------------------------------------------------ I. BRIEF BACKGROUND In UNIX, all files (regardless of filetype, ASCII or binary, for instance) consist of lines and terminated by an End-of-File (EOF). Each line is terminated by “\n” (ASCII value 42 or 2A). In DOS text files end with “\r\n” where “\r” is ASCII 45 or 2D. While \n contributes to the byte count of the file it is not a part of any line. If you inspect "bash" (or csh, korn or any other shell) scripts you will find a liberal use of "grep", "sed" and "awk". A good bash programer needs to be an expert with these utilities. Separately, they are fun to use: terse and curiously addictive. We start off by learning how to use "grep". The simplest use of grep is to search for a fixed pattern. grep arose from "ed", the first line editor in UNIX. It stands for "Global search for REgular expressions and Print matching lines". Before proceeding further as I noted above a UNIX file consists of lines and such files are well suited to sed and awk, In contrast, your Windows emailer works best with a line per paragraph. grep, sed and awk are not terribly useful if your file consists of one long line! Before proceeding I have to tell you that grep or sed, cannot, ordinarily, search for \n. It is worth remembering this basic point. It is my experience that the fastest way to learn a programming language is via examples. Therefore, starting the next section I urge you NOT to read but "do it". Note below you type everything but not the "$". This is the "$" supplied by bash. The text following "#" is usually an explanation of a result and is not the result of output of any program. Occasionally lines starting with "#" are pure comments lines. Also the end-of-file signal is given by "control D" (spelt as cntrl D) NB: I will assume that you are familiar with the following UNIX features: > (redirection), | (pipe), control D (end of file), “ls”, “cat” and the “echo” commands. ------------------------------------------------------------------------ IB. First example ------------------------------------------------------------------------ The simplest format of grep is grep pattern inputstream where pattern can be unquoted (i.e. the search pattern spelt in the usual fashion) or quoted weakly ("...") or quoted strongly ('...'). For now we will start with no quoting and then gradually learn when quoting is necessary (and when a particular type of quoting is necessary. Below we use the UNIX echo command to feed an input line to grep. $ echo hello hello grep reads input and returns the line if a match is find. No line is returned if no match is found $ echo hello | grep e - #formally the input is the keyboard hello #match $ echo hello | grep e #this informal approach also works hello #match $ echo hello | grep z #no match, no line returned However, typically grep is used with an input file rather than a stream of characters coming from a keyboard. [All input files used in this lesson can be found this sub-directory]. $ cat > SimpleInput.dat 1 hello, parvi rabbit 2 there are many types of rabbits, ordinary rabbits 3 and rabbits that have been bred over centuries. 4 Rabbits have been around, apparently, for at least 5 three million years. cntrl D $ grep rabbit SimpleInput.dat 1 hello, parvi rabbit #exact match 2 there are many types of rabbits, ordinary rabbits #match, even if plural 3 and rabbits that have been bred over centuries. Before I explain the result I remark that we are used to thinking in terms of "words". A word is (usually) a collection of pure alphabets preceded by a "blank" (which is usually a space character or perhaps a tab) and succeeded by a blank or a character from the set of punctuations (.;,?"'). When writing or thinking about UNIX abandon your training in English. grep (and other UNIX utilities) operate on a literal basis. In the example above we asked grep to search for "rabbit". It found matches to "rabbit on line 1, 2 and 3. You are comfortable with the match on line 1 but are likely not comfortable (or puzzled) by the matches reported on lines 2 and 3. The fact is that "rabbits" is also a match to "rabbit" in a literal sense. If you wanted to search exclusively for the word rabbit then you should say "find a pattern which is 'rabbit’ AND has a space preceding it AND a space or punctuation succeeding it". Even this may not be an air-tight prescription since it will miss the sentence which starts off "rabbit" (in this case the preceding character is a null character — for which there is no ASCII representation). Another possibility is that the word is enclosed by "()" or double quotes or single quotes. So you can see that an accurate and comprehensive specification to search for the word "rabbit" is not trivial. Next, note that grep searches a line and upon finding the firs match stops the search and prints the line. It then proceeds to the next line For this reason, in the above example, line 2 is printed only once. The example above was perhaps the simplest possible search. grep has powerful search capability. You can specify a pattern via a framework and grep will find all patterns that match the framework. For that you have to learn "regular" expressions. ------------------------------------------------------------------------ II. ANCHORS: ^, $ ------------------------------------------------------------------------ Regular expressions (like pipes, universal file construct) lie at the heart of UNIX. There are many families of regular expression ("regexp"). We will start with the earliest and most basic regular expression (BRE). As noted above all UNIX lines consist of lines followed by an EOF. Lines have a beginning (null) and an end (\n which has an ASCII representation). Clearly, it makes sense to start with markers for the beginning and end of the line. The characters “^” and “$” are anchors (positional indicators) for the start and end of line. These two characters are called as "meta-characters" because they have extra-ordinary power (unless "escaped"; more on this later). I have to warn you that the rest of this section is not easy. You will find that your intuition will, almost always, be wrong. It is worth doing the exercises several times whilst synthesizing the results. Ideally, you should grasp this section fully proceeding ahead. However, likely you may find that this is not possible. If so, simply go ahead and keep revisiting this section! $ cat InputRegEx.dat hello, parvi % hello, other rabbits .hello, kitty, & good morning, world # sayonara wall street$ #Now we undertake a variety of searches $ grep ^h InputRegEx.dat #identify lines that begin with "h" hello, parvi rabbit $ grep ^% InputRegEx.dat #identify lines that begin with "%" % hello, rabbits $ grep parvi$ InputRegEx.dat #find lines ending with parvi hello, parvi #For pedagogical purpose I list some constructs which #are useless because they list all lines in the input file! #Understand why this is the case $ grep $ filename #Why? 'cause all lines have an ending $ grep ^ filename #Why? 'cause all lines have a beginning #the following is useless since it does not list any line! $ grep A^ filename # if A is any character other than "[" Please ponder the above command and convince yourself why there is no match. If you are not able to figure it out then talk to me. Additional Literature: The wikipedia has an excellent & comprehensive review on regular expressions. https://en.wikipedia.org/wiki/Regular_expression The article may be appreciated after you finish taking this class. ------------------------------------------------------------------------ IIB. First introduction to Quoting ------------------------------------------------------------------------ So far our patterns have not been “quoted”. We will find that sometimes you HAVE to quote. To illustrate this let us say that our goal is to find lines ending with "#". $ grep #$ InputRegEx.dat #command fails # ... "GREP1/grep does not exist" The reason this command failed is that "#" is a special character for the shell (“comment”). Characters succeeding # are not read by the shell. The “do not read” feature lasts for the rest of the line. Quoting informs the shell that "#" should not be interpreted in the usual fashion. #Either quoting works $ grep '#$' InputRegEx.dat $ grep "#$" InputRegEx.dat For this reason, most users simply start quoting all patterns. There will be a more detailed and finer distinction between double and single quotes. Curious? Try $ grep '$A' InputFile #Will not return any hit regardless of what A is ------------------------------------------------------------------------ IIC. Converting meta-characters to literals ------------------------------------------------------------------------ #Goal: you want to list lines beginning with "." $ grep '^.' InputRegEx.dat #Naively you try the obvious approach hello, parvi rabbit % hello, rabbits .hello, kitty, & good morning, world # sayonara wall street$ Why did all lines get listed? All lines have a beginning, “^”. Most lines have at least one character. In regexp, as explained below (IV) “.” is a meta-character. It denotes one character (any character). Thus the pattern ^. is satisfied by lines which have one or more characters. In the above example, all lines satisfy this criterion. Clearly, we need to inform grep that we are genuinely looking for the period or “.” character. So we need to strip “.” of its special meaning. We do so by “escaping” it and that is “\,”. $ grep '^\.' InputRegEx.dat .hello, kitty #It worked! Do not fret too much if you do not understand this example. Below we have an entire section (VII) devoted to this painful issue. Curious? Try $ grep ^\. InputRegEx.dat #will print the entire file. Why? hello, parvi % hello, other rabbits .hello, kitty, & good morning, world # sayonara wall street$ [This is an idiosyncrasy that one should get used to! Lesson: quote the pattern]. ------------------------------------------------------------------------ III. INTERLUDE: Line numbering, showing non-printable characters ------------------------------------------------------------------------ Since the UNIX framework for files is a series of lines it is only natural to have many utilities number each line. Numbering lines is extremely useful for identifying errors in bugs and identifying lines of interest when using line oriented editors like vi (or emacs for that matter). $ cat Lines.dat a aa b ba #To make this “line with zero characters” simply hit return $ cat -e Lines.dat #with option "-e" control chars are revealed aa$ #"\n" is displayed as "$" b$ ba$ $ #Notice this line. It is a null line (no chars) $ cat -e -n Lines.dat # the option "-n" results in printing of input 1 a$ #line number 2 aa$ 3 b$ 4 ba$ 5 $ #make sure that there is no space preceding $. Incidentally, if you actually wanted to insert lines into a file that you have already constructed then you can do so by in at least a dozen ways. The classic method is $ nl Lines.dat > LinesNumbered.dat Note that ordinarily nl does not number empty lines. If you want to number all lines then $ nl -ba Lines.dat > LinesNumbered.dat ------------------------------------------------------------------------ IIIb. Options ------------------------------------------------------------------------ All UNIX utilities have "options". grep has many options. It is worth reading the “man” pages (just to appreciate the richness of grep). Below we will use the option “-n” (for numbering output lines using input line number) $ grep -n ‘’ Lines.dat ------------------------------------------------------------------------ IV. Quantification: * . \{ \} ------------------------------------------------------------------------ The regular expression "." means match one character (any character) $ grep -n "." Lines.dat #list all lines which contain one character 1:a 2:aa 3:b 4:ba As you can see, line 5 is not in the output (because it is a null line; just \n). $ grep -n "a" Lines.dat #Search for lines containing "a" 1:a 2:aa 4:ba $ grep -n “^a" Lines.dat #Search lines starting with "a" 1:a 2:aa The regular expression “C*” maps to (1) one or more of the character “C” (where C can be any character) OR (2) a character of zero length. The rookie mistake is to forget (2). Note that “*” applies to the single character preceding it. Thus “ba*” maps to “b”, “ba”, “baa” and so on. $ grep -n 'a*' Lines.dat #Any such command will match ALL lines 1:a #Why? (Think hard and only proceed after you have 2:aa #figured out the answer 3:b 4:ba 5: #Notice even this null line is also matched $ grep -n 'aa*' Lines.dat #matches one or more "a" 1:a 2:aa 4:ba $ grep -n 'a\{2\}' Lines.dat #Match two “a” in succession 2:aa $ grep -n 'a\{1,2\}' Lines.dat #Match "aa" or "a" 1:a 2:aa 3:ba $ grep -n 'a\{1,\}' Lines.dat #Matches “a”, “aa”, “aaa” and so on 1:a 2:aa 4:ba ROOKIE MISTAKE GIVES GREAT UNDERSTANDING: Now I address the usual rookie mistake. Review the command below. Do not execute command. First mentally compute your answer. Then execute the command and explain the rather surprising answer. Commit this example to memory. $ echo "abc" | grep 'z*' abc ------------------------------------------------------------------------ IVB. Introduction to sed ------------------------------------------------------------------------ I take the occasion to introduce "sed". No need to get stressed out! We will have two classes on sed. The usual invocation of sed is $ sed action InputFile The most famous action is to "identify and replace" (substitution). Say you have written a story (file: Storybook) in which the main protagonist is a dog. After finishing and upon reflection you decide to change the protagonist to a rabbit (naturally). $ sed 's/dog/rabbit/' Storybook #this is not a foolproof substitution. # ”dogged” becomes “rabbitged”. $There is a simple way to fix such errors. $ echo "abc" | sed 's/a/A/' #will substitute A for a, but for the first match Abc $ echo "abca" | sed 's/a/A/g' #will substitute A for a, for all matches $AbcA #the “flag” “g” stands for “global” (all matches) Having learnt one (but powerful) command in sed let us redo the rookie mistake exercise. $ echo "abc" | sed 's/b*/1/' 1abc $ echo "abc" | sed 's/b*/1/g' 1a1c1 Please ponder the outcomes! ------------------------------------------------------------------------ V. CHARACTER SET: ------------------------------------------------------------------------ A character set allows for a set of characters to be probed instead of only one. [abc] would match "a", "b" or "c" >> Remember a character set is still only one character << [If you want more than one character then you need to concatenate as in “[abc][xz]”. This pattern is matched by six combinations: ax, az, bx, bz, cx and cz.] Ranges are allowed (they map to underlying continuous range ASCII values; if this statement is not clear look up the ASCII table): For instance [0-9] is a character set which includes all digits from 0 to 9. By induction [a-z] a through z (lower case) [A-Z] A through Z (upper case) [a-d] a through d Ranges can be combined: [0-9a-z] This pattern is matched by digits and lower case alphabets. [a-zA-Z] This pattern is matched by alphabets, independent of case There is no ambiguity in the above construction. The first “-“ specifies the range a through z and the second dash specifies the range A through Z. Adding “^” as the first character in a set negates the entire set. For example, [^0-9] means all characters BUT the 10 digits. [It is unfortunate that “^” has another meaning; see II]. $ cat Lines2.dat a b bc def 1 345 &%^?-+ $ grep '[a-z]' Lines2.dat a b bc def $ grep '[0-9]' Lines2.dat 1 345 $ grep '[A-Z]' Lines2.dat $ grep '[^a-z]' Lines2.dat 1 345 &?-+ $ grep '[0-9a-z]' Lines2.dat a b bc def 1 345 $ grep [^0-9a-z] Lines2.dat #example of negating two ranges &?-+ ------------------------------------------------------------------------ VB. POSIX Character classes ------------------------------------------------------------------------ Rather than write [0-9] or [a-zA-Z] we can use POSIX character classes [:digit:] any digit, 0-9 [:alpha:] any alphabe (upper & lower), A-Z & a-z [:alnum:] any alpha or numeric character [:cntrl:] control characters [:print:] any printable characters (i.e [^[:cntrl:]]) [:blank:] a space or tab [:graph:] neither a space nor a tab POSIX classes can be used only within the character set framework. Thus the correct usage is always “[[:digit:]]”. $ grep '[[:digit:]]' Lines2.dat #selects lines with digits 1 345 $ grep [^[:digit:][:alpha:]] Lines2.dat #selects non-alphanumeric chars &?-+ COMMON ROOKIE MISTAKE: It is a common mistake to use “[:digit:]” instead of “[[:digit:]]. Study (and understand) the rather curious outcome $ grep '[:digit:]' Lines2.dat def ------------------------------------------------------------------------ VI. Back-referencing & Extracting Tokens ------------------------------------------------------------------------ The construct "\(RegularExpression\)" yields a token. The text thus captured is available in “\1”. Nine such tokens can be defined (in sequential order). Thus for example "\(.\)" captures a character. The simplest use of back-referencing is to search for a repeated pattern in the same line. Using back-reference saves retyping the pattern again. This is not the best use of back-references (actually it slows down the performance). $ grep '\(sky is blue\).*\1' BlueSky.dat The sky is blue at 10 am. The sky is blue at noon. A better example is to extract quoted words $ echo 'Hello, my name is "Kitty"' | grep "\([\"']\).*\1" PAUSE: You should pause here and understand the quoting construction here.The basic rule when you wish to use either quoting then you should build inside out. Start with the quoting for echo == 'Hello, my name is "Kitty"'. Here I wanted to quote Kitty with double quotes. Ergo, the outer quoting has to be single quote. Next the pattern for grep. Here I wanted to include both " and ' inside the pattern. Given that we want both quotes I arbitrarily chose " as the outer quote. Ergo, the double quote " must be escaped inside the patter. Having understood the quoting let us study the pattern "\([\"']\).*\1" The pattern is designed to find all characters enclosed between double quotes or single quotes. Say a double quote is found as grep scans from left to right. Then grep looks for a closing double quote and not a single quote. The same argument applies for a single quote. This is indeed a clever use of back-referencing. A fun use of back-reference is to identify palindromes. In this case the pattern to be identified is unknown. The expression "\(.\).\1" accurately describes 3-character palindrome, e.g. “ANA”, “dud”, “eve”, “you” and so on. Armed with tokens let us search for five character palindromes! $ cat Palindrome.dat 1:In the city of GadaG there lived two dogs whose owner was on Mr. TenteT. 2:One was called RotoR and the other SoloS. 3:One day one dog ran away and other dog got hit by a bus. 4:SoloS and the owner became very sad. $ grep -n '\(.\)\(.\).\2\1' Palindrome.dat 1:In the city of GadaG there lived two dogs whose owner was on Mr. TenteT. 2:One was called RotoR and the other SoloS. 3:One day one dog ran away and other dog got hit by a bus. 4:SoloS and the owner became very sad. The identification of GadaG and RotoR is clear. Incidentally, TeneT and SoloS were not identified by grep. The search stopped when a match was found. grep printed the line and went to the next line. [More below]. >> You may wish to figure out why line 3 is reported << Our third class will be on “sed”. However, modest use of sed will give insight into the above exercise. Here is a brief summary of how the substitution by sed works. $ sed ’s;Reg;Replace;' InFile replaces text found by regular expression, Reg, by text given by Replace. Below we will identify palindromes and replace them by “Z” (as a marker). $ sed 's;\(.\)\(.\).\2\1;Z;' Palindrome.dat In the city of Z there lived two dogs whose owner was on Mr. TenteT. One was called Z and the other SoloS. One day one dog ran away and other dZt hit by a bus. Z and the owner became very sad. ------------------------------------------------------------------------ VII. How to search for meta-characters ------------------------------------------------------------------------ The meta-characters for basic regular expressions (BRE) are ^ $ . * [ ] - & A meta-character is downgraded to an ordinary character by escaping it. That is when you wish to use any of the above characters literally then precede it by “\”. We have one such example in IIC. However, not all meta-characters need to be escaped (for sound reasons). Though it is a technical point it is not prudent to simply escape ALL meta-characters (though you can get away by escaping all of them; it is not POSIX compliant). $ cat Metacharacter.dat We explore meta-characters here: *, ., ^, $, [, ] & -. We have not introduce & so far. #this is a null line (see III) .Some UNIX tools have line starting with a . In regular English no sentence ends in $ $ grep '\$' Metacharacter.dat #Identify lines with a “$” character We explore meta-characters here: *, ., ^, $, [, ] & -. In regular English no sentence ends in $ $ grep ']' Metacharacter.dat We explore meta-characters here: *, ., ^, $, [, ] & -. #surprised? $ grep '[' Metacharacter.dat grep: brackets ([ ]) not balanced #hmm $ grep '\[' Metacharacter.dat #solution 1 We explore meta-characters here: *, ., ^, $, [, ] & -. $ grep '\^' Metacharacter.dat #find lines with “^” character We explore meta-characters here: *, ., ^, $, [, ] & -. You can deepen your understanding of the regular expression framework by exploring character sets containing meta-characters. To start with, meta-characters lose their special status when inside a character set. However, how you place them inside “[ ]” does matter. $ grep '[]]' Metacharacter.dat We explore meta-characters here: *, ., ^, $, [, ] & -. $ grep '[^]' Metacharacter.dat #hmmm grep: brackets ([ ]) not balanced $ grep '[]*]' Metacharacter.dat We explore meta—characters here: *, ., ^, $, [, ] & -. $ grep '[*]]' Metacharacter.dat does not return any line. Why? ------------------------------------------------------------------------ HOMEWORK: [1] Look at the file "BlankLines.dat". It has a variety of blank lines. Use "grep -n" and identify (and count) all blank lines. Then do the same with a finer search (space, tab, null). Hint: "tab" is a control character and the way you pass this character to a pattern is "cntrl V tab". [2] Devise a way to search for words defined as follows: a word is defined to consists only of lower case alphabets except the first alphabet which can be capital. The word can be preceded by a blank, tab or null (start of aline). It can end with a blank, space or punctuations (,;.?:!). The word can be enclosed in single or double quotes. [3] The utility “nl” has some interesting options. $ nl -bp’Regexp’Infile will number only those lines with regular expression specified by Regexp. Compare that with “grep -n”