------------------------------------------------------------------------ participants at a conference ------------------------------------------------------------------------ You are at a conference and you are curious to know the geographical distribution of the delegates. From the website you extract a txt file and delete the headers and footers to find "Participants.txt" (see end of this writeup). $ cat Participants.txt Awni Al-Khasawneh Jordan Ulisses Barres de Almeida Brazil Scott Barthelmy USA Varun Bhalerao India John Blakeslee USA Patrick Brady USA Federica Bianco USA Robert Braun UK David Buckley South Africa Laura Cadonati USA Manuella Campanelli USA John Carpenter Chile Alberto Castro-Tirado Spain Brad Cenko USA Valerie Connaughton USA Roger Davies UK Rob Fender UK Anna Franckowiak Germany Avishay Gal-yam Israel Ranpal Gill USA Evgeny Gorbovskoy Russia Paul Groot Netherlands Liz Hays USA Rob Ivison Germany Mansi Kasliwal USA Shri Kulkarni USA Michelle Lochner South Africa Christopher Martin USA Jamal Mimouni Algeria Kavilan Moodley South Africa Tara Murphy Australia Ada Nebot France Samaya Nissanke Netherlands John O'Meara USA Steve Potter South Africa Elena Rossi Netherlands Somaya Saad Egypt Re’em Sari Israel Roberto Soria China Sarah Burke Spolaor USA Ben Stappers UK Lisa Storrie-Lombardi USA Ignacio Taboada USA Pietro Ubertini Italy Johannes van den Brand Netherlands Ewine van Dishoek Netherlands Patricia Whitelock South Africa Patrick Woudt South Africa Masayuki Yamanaka Japan Binbin Zhang China Shuang-Nan Zhang China You will notice some complications: most names are two words but Dutch and Latino names can be three (Evine van Dishoek) or four (Ulisses Barres de Almeida). Let us assume that the participant names and the country names are separated by a tab. AWK uses associative arrays which are extremely good at histogramming data. The output is sorted by descending order of the number of participants. The "column" command nicely formats the output. $ awk -F"\t" '{a[$2]++}END{for (i in a){print a[i],i}}' Participants.txt \ | sort -nr | column -t USA 17 S.Africa 6 Netherlands 5 UK 4 China 3 Israel 2 Germany 2 Spain 1 Russia 1 Jordan 1 Japan 1 Italy 1 India 1 France 1 Egypt 1 Chile 1 Brazil 1 Australia 1 Algeria 1 File in which spaces are used (non-tab separation). At the next conference it may well be the case that the participant names and the country names are separated by a simple blank character. In this case, countries with two words (e.g. South Africa) pose a problem. The key to making such a list "regular" is to identify which require the least "fixup". The number of countries is <200 whereas the number of names is much larger. So we fix the countries. We convert double word countries to "A.B". So "South Africa" becomes "S.Africa". With this done, the country is always the last field in any record. $ awk '/South Africa/{$NF="";$NF="S.Africa"} \ /Saudi Arabia/{$NF="";$NF="S.Arabia"} \ {a[$NF]++} \ END{for (i in a){print i, a[i]}}' Participants.txt | sort -k2 -nr | column -t