-------------------------------------------------------------------------------- Histogramming signature list -------------------------------------------------------------------------------- Occasionally you see a petition coming from students and alums of a college. Usually the letter ends with the names and the year they graduated. I am alway curious about the distribution of the ages (well really the years post college). I pull off the text file from the web and then save the list of signatures. $ cat Middlebury.txt Sam Catlin ‘14.5 Kate McCreary ‘15 Alexandria Jackman ‘14 ... Our goal here is seemingly simple: for each line just retain the graduation year. A further goal is to get rid of the 0.5 and either round up or down. However, the file was generated with Windows and has many weird characters. I could not delete these characters with sed or tr. Understanding to deal with such characters is in fact the point of this example. [The weird characters are normal in the sense that most of the world is now using "UTF-8" or UTF8 coding. This coding includes basic ASCII (first 128 values) and then much much more. I. WEIRD CHARACTERS For the purpose of the exercise I will save the first line as 1b.txt $ gcat -A 1b.txt #GNU cat is a simple way to see cntrl or upper ASCII Sam Catlin M-bM-^@M-^X14.5$ So indeed, we have some weird characters between the last space and the end of the line. We will use "od" (octal dump) to see these characters $ od -c -b -a 1b.txt 0000000 S a m C a t l i n ‘ ** ** 1 4 123 141 155 040 103 141 164 154 151 156 040 342 200 230 061 064 S a m sp C a t l i n sp ? 80 98 1 4 0000020 . 5 \n 056 065 012 . 5 nl So we see that the offending characters are \342, \200 and \230 (these are octal representations) and predictably they are >127 (upper ASCII). IIA. HOW TO DEAL WITH WEIRD CHARACTERS: iconv #iconv allows you to convert one setting of coding to another set #iconv -f inputfile-encoding -t desired-coding infile > outfile #the switch "-c" means delete all chars that are not present in the #output coding $ iconv -t ASCII -c 1b.txt Sam Catlin 14.5$ IIB. HOW TO DEAL WITH WEIRD CHARACTERS: use classes to retain what you want # "c" stands for complement, "d" for delete # so retain alphanumeric, period, newline and space $ tr -cd '[[:alnum:] \.\n]' < 1b.txt |gcat -A Sam Catlin 14.5$ IIC. HOW TO DEAL WITH WEIRD CHARACTERS: LOCALE Setting The "locale" setting inform UNIX collating sequences, the mapping between characters & values. We assume that there are only 128 ASCII characters but with Windows we have all sorts of weird characters. $ echo $LANG #check the locale setting en_us.UTF-8 $ lang=$LANG #save the setting $ LANG="C" #use the collation sequence etc that is the simplest: standard ASCII $ echo $LANG C $ tr -d "\200-\377" < 1b.txt | gcat -A #delete all "upper ascii" char (128-255 decimal) Sam Catlin 14.5$ III. THE FINALE We need to retain only the last set of characters and then get rid of the "." $ tr -cd '[[:alnum:] \.\n]' < 1b.txt | sed 's/\(^.* \)\([0-9\.]*$\)/\2/;s/\.5//'14 $ iconv -t ASCII -c Middlebury.txt | sed 's/\(^.* \)\([0-9\.]*$\)/\2/;s/\.5//' | sort|uniq -c 1 01 1 03 3 05 1 06 2 07 5 08 ... The first column is the number of students and the second column is the year of graduation (limited to last two digits of graduation year). It is not a nice display. We would like to convert the truncated graduation year to the full year. Furthermore the years are incorrectly order (00 succeeds 71 which is not correct). $ iconv -t ASCII -c Middlebury.txt | sed 's/\(^.* \)\([0-9\.]*$\)/\2/;s/\.5//' | sort|uniq -c | awk '{b=$2;a=(b<20)*2000+$2+($2>20)*1900;print a,$1}' | sort -n 1971 1 1979 1 1987 1 Voila! You are done IV. APPLICATION I applied this to the recent petition by Middlebury College students. https://middleburycampus.com/article/charles-murray-at-middlebury-unacceptable-and-unethical-say-over-450-alumni/ Conclusion: Activism has a duration of about 6 years. % year of graudation & number of students graduating in that year 1971 1 1979 1 1987 1 1991 1 1994 1 1999 1 2001 1 2003 1 2005 3 2006 1 2007 2 2008 5 2009 22 2010 33 2011 46 2012 59 2013 71 2014 84 2015 100 2016 66 2018 1