------------------------------------------------------------------------
participants at a conference
------------------------------------------------------------------------
You are at a conference and you are curious to know the geographical
distribution of the delegates. From the website you extract a txt
file and delete the headers and footers to find "Participants.txt"
(see end of this writeup).

$ cat Participants.txt
Awni Al-Khasawneh	Jordan
Ulisses Barres de Almeida	Brazil
Scott Barthelmy	USA
Varun Bhalerao	India
John Blakeslee	USA
Patrick Brady	USA
Federica Bianco	USA
Robert Braun	UK
David Buckley	South Africa
Laura Cadonati	USA
Manuella Campanelli	USA
John Carpenter	Chile
Alberto Castro-Tirado	Spain
Brad Cenko	USA
Valerie Connaughton	USA
Roger Davies	UK
Rob Fender	UK
Anna Franckowiak	Germany
Avishay Gal-yam	Israel
Ranpal Gill	USA
Evgeny Gorbovskoy	Russia
Paul Groot	Netherlands
Liz Hays	USA
Rob Ivison	Germany
Mansi Kasliwal	USA
Shri Kulkarni	USA
Michelle Lochner	South Africa
Christopher Martin	USA
Jamal Mimouni	Algeria
Kavilan Moodley	South Africa
Tara Murphy	Australia
Ada Nebot	France
Samaya Nissanke	Netherlands
John O'Meara	USA
Steve Potter	South Africa
Elena Rossi	Netherlands
Somaya Saad	Egypt
Re’em Sari	Israel
Roberto Soria	China
Sarah Burke Spolaor	USA
Ben Stappers	UK
Lisa Storrie-Lombardi	USA
Ignacio Taboada	USA
Pietro Ubertini	Italy
Johannes van den Brand	Netherlands
Ewine van Dishoek	Netherlands
Patricia Whitelock	South Africa
Patrick Woudt	South Africa
Masayuki Yamanaka	Japan
Binbin Zhang	China
Shuang-Nan Zhang	China


You will notice some complications: most names are two words but
Dutch and Latino names can be three (Evine van Dishoek) or four
(Ulisses Barres de Almeida).  Let us assume that the participant
names and the country names are separated by a tab.

AWK uses associative arrays which are extremely good at histogramming
data.  The output is sorted by descending order of the number of
participants.  The "column" command nicely formats the output.

$ awk -F"\t" '{a[$2]++}END{for (i in a){print a[i],i}}' Participants.txt \
	| sort -nr | column -t
USA          17
S.Africa     6
Netherlands  5
UK           4
China        3
Israel       2
Germany      2
Spain        1
Russia       1
Jordan       1
Japan        1
Italy        1
India        1
France       1
Egypt        1
Chile        1
Brazil       1
Australia    1
Algeria      1


File in which spaces are used (non-tab separation). At the next
conference it may well be the case that the participant names and
the country names are separated by a simple blank character. In
this case, countries with two words (e.g. South Africa) pose a
problem.

The key to making such a list "regular" is to identify which require
the least "fixup". The number of countries is <200 whereas the
number of names is much larger. So we fix the countries.  We convert
double word countries to "A.B". So "South Africa" becomes "S.Africa".
With this done, the country is always the last field in any record.

$ awk '/South Africa/{$NF="";$NF="S.Africa"} \
       /Saudi Arabia/{$NF="";$NF="S.Arabia"} \   
	{a[$NF]++}                          \     
     END{for (i in a){print i, a[i]}}'  Participants.txt  | sort -k2 -nr | column -t