Histogramming signature list

Occasionally you see a petition coming from students and alums of
a college.  Usually the letter ends with the names and the year
they graduated. I am alway curious about the distribution of the
ages (well really the years post college).  I pull off the text
file from the web and then save the list of signatures.

$ cat Middlebury.txt 
Sam Catlin ‘14.5
Kate McCreary ‘15
Alexandria Jackman ‘14

Our goal here is seemingly simple: for each line just retain the
graduation year.  A further goal is to get rid of the 0.5 and either
round up or down.

However, the file was generated with Windows and has many weird
characters.  I could not delete these characters with sed or tr.
Understanding to deal with such characters is in fact the point of
this example. [The weird characters are normal in the sense that most
of the world is now using "UTF-8" or UTF8 coding. This coding includes
basic ASCII (first 128 values) and then much much more.

For the purpose of the exercise I will save the first line as 1b.txt

$ gcat -A 1b.txt       #GNU cat is a simple way to see cntrl or upper ASCII
Sam Catlin M-bM-^@M-^X14.5$

So indeed, we have some weird characters between the last space and
the end of the line.

We will use "od" (octal dump) to see these characters
$ od -c -b -a 1b.txt
0000000    S   a   m       C   a   t   l   i   n       ‘  **  **   1   4
          123 141 155 040 103 141 164 154 151 156 040 342 200 230 061 064
           S   a   m  sp   C   a   t   l   i   n  sp   ?  80  98   1   4
0000020    .   5  \n                                                    
          056 065 012                                                    
           .   5  nl                                                    

So we see that the offending characters are \342, \200 and \230 (these
are octal representations) and predictably they are >127 (upper ASCII).


	#iconv allows you to convert one setting of coding to another set
	#iconv -f inputfile-encoding -t desired-coding infile > outfile
	#the switch "-c" means delete all chars that are not present in the
	#output coding
$ iconv -t ASCII -c 1b.txt 
Sam Catlin 14.5$

IIB. HOW TO DEAL WITH WEIRD CHARACTERS: use classes to retain what you want

		# "c" stands for complement, "d" for delete
		# so retain alphanumeric, period, newline and space
$ tr -cd '[[:alnum:] \.\n]' < 1b.txt |gcat -A
Sam Catlin 14.5$


The "locale" setting inform UNIX collating sequences, the mapping between characters & values. 
We assume that there are only 128 ASCII characters but with Windows we have all sorts
of weird characters.

$ echo $LANG	#check the locale setting

$ lang=$LANG	#save the setting
$ LANG="C"      #use the collation sequence etc that is the simplest: standard ASCII
$ echo $LANG

$ tr -d "\200-\377" < 1b.txt | gcat -A    #delete all "upper ascii" char (128-255 decimal)
Sam Catlin 14.5$


We need to retain only the last set of characters and then get rid of the  "."

$ tr -cd '[[:alnum:] \.\n]' < 1b.txt | sed 's/\(^.* \)\([0-9\.]*$\)/\2/;s/\.5//'14

$  iconv -t ASCII -c  Middlebury.txt | sed 's/\(^.* \)\([0-9\.]*$\)/\2/;s/\.5//' | sort|uniq -c 
   1 01
   1 03
   3 05
   1 06
   2 07
   5 08

The first column is the number of students and the second column
is the year of graduation (limited to last two digits of graduation
year). It is not a nice display. We would like to convert the
truncated graduation year to the full year.  Furthermore the years
are incorrectly order (00 succeeds 71 which is not correct).

$ iconv -t ASCII -c  Middlebury.txt | sed 's/\(^.* \)\([0-9\.]*$\)/\2/;s/\.5//' | sort|uniq -c |  awk '{b=$2;a=(b<20)*2000+$2+($2>20)*1900;print a,$1}' | sort -n
1971 1
1979 1
1987 1

Voila! You are done

I applied this to the recent petition by Middlebury College students.


Conclusion: Activism has a duration of about 6 years. 

% year of graudation & number of students graduating in that year
1971 1
1979 1
1987 1
1991 1
1994 1
1999 1
2001 1
2003 1
2005 3
2006 1
2007 2
2008 5
2009 22
2010 33
2011 46
2012 59
2013 71
2014 84
2015 100
2016 66
2018 1