------------------------------------------------------------------------
I. Country Codes & Country Populations
------------------------------------------------------------------------
In this exercise, we will first extract a comma-separated table of
the population of countries around the world. Next, we will extact
a a table of country codes (2-letter or 3-letter). Finally, we will
"join" the two tables.
------------------------------------------------------------------------
II. Population of countries
------------------------------------------------------------------------
Our starting point is the Worldometer site
https://www.worldometers.info/world-population/population-by-country/
We save the ouput using "text" option to file "pop_html.txt". The
data are found between lines starting "Share" and "Source". We
extract those lines Notice the elegant way by which we get rid of
the input file once the data are extracted.
$ infile=pop_html.txt
$ gsed '1,/^Share/d;/^Source:/,$d;/^$/d' $infile > pop.out && rm $infile
$ cat pop.out
1 China
1,439,323,776 0.39 % 5,540,090 153 9,388,211 -348,399 1.7 38
61 % 18.47 %
2 India
1,380,004,385 0.99 % 13,586,631 464 2,973,190 -532,687 2.2 28
35 % 17.70 %
.
.
Notice that each country's information starts with a regular pattern
/^[1-9][0-9]* \t/. Thus, unlike the usual records, the Record
Separator (RS) is not at the end of the text but at the beginning.
I will use the term Beginning Record Separator (BRS) to distinguish
this term from RS. Furthermore, the text for each country is spread
over the next two or three lines (it varies). So our task is to
create a record by grouping all the text between two BRSs.
The solution here is to use the "multi-line" feature of awk.
To this end we first do the following
$ gsed '/gsed '1!{/^[1-9][0-9]* \t/{x;p;x}};$G' pop.out
1 China
1,439,323,776 0.39 % 5,540,090 153 9,388,211 -348,399 1.7 38
61 % 18.47 %
2 India
1,380,004,385 0.99 % 13,586,631 464 2,973,190 -532,687 2.2 28
35 % 17.70 %
3 United States
Notice the line preceding each country is a blank line (/^$/), except for the
first country (China) and the last line of the output is a blank line.
We can now use some curious but powerful features of awk to read
such multi-line records. The input field separator (IFS) is now set to
"\n" and the record separator, RS="" (which is almost the same as
"\n\n+").
$ gsed '/gsed '1!{/^[1-9][0-9]* \t/{x;p;x}};$G' pop.out | \
awk -F"\n" '{gsub("/\n/"," ")}' RS=""
1 China 1,439,323,776 0.39 % 5,540,090 153 9,388,211 -348,399 1.7 38 61 % 18.47 %
2 India 1,380,004,385 0.99 % 13,586,631 464 2,973,190 -532,687 2.2 28 35 % 17.70 %
3 United States 331,002,651 0.59 % 1,937,734 36 9,147,420 954,806 1.8 38 83 % 4.25 %
Now the data for each country is one line. We will filter out the inessential parts.
$ gsed '/gsed '1!{/^[1-9][0-9]* \t/{x;p;x}};$G' pop.out | \
awk -F"\n" '{gsub("/\n/"," ")}' RS="" | \
gsed 's/^[1-9][0-9]* \t/"/;s/,//g;s/ <.*>/", /;s/ %/%,/g;s/[0-9] /&,/g;s/,$//' > Worldometer_WP.csv
$ head -n 5 Worldometer_WP.csv
"China", 1439323776 , 0.39%, 5540090 , 153 , 9388211 , -348399 , 1.7 , 38 ,61%, 18.47%
"India", 1380004385 , 0.99%, 13586631 , 464 , 2973190 , -532687 , 2.2 , 28 ,35%, 17.70%
"United States", 331002651 , 0.59%, 1937734 , 36 , 9147420 , 954806 , 1.8 , 38 , 83%, 4.25%
"Indonesia", 273523615 , 1.07%, 2898047 , 151 , 1811570 , -98955 , 2.3 , 30 , 56%, 3.51%
"Pakistan", 220892340 , 2.00%, 4327022 , 287 , 770880 , -233379 , 3.6 , 23 , 35%, 2.83%
Now with the csv file in hand we are ready to rock-n-roll.
------------------------------------------------------------------------
Country 2-letter and 3-letter codes
------------------------------------------------------------------------
ISO provides a 3-letter country code
https://en.wikipedia.org/wiki/ISO_3166-1_numeric
We will use the country code list provided by the International Banking
Association which can be obtained at
$ URL="https://www.iban.com/country-codes"
$ curl -o a.txt $URL
Review file a.txt
$ curl $URL | sed -n '/
/,/<\/tbody>/p' | gsed -E '1d;$d;s:?t[dr]>::g;/^\t*$/d;s/\t//g' | paste - - - - | datamash reverse > ISO_country_codes.dat
Most of the sed stuff is cleaning up. However, notice the clever
use of paste to create a table with four columns (from four successive
lines of data) and also the use of datamash to reverse the ordering
of the table. [In a table it helps to have strings as the last
column].
$ head ISO_country_codes.dat; echo ".\n."; tail ISO_country_codes.dat
004 AFG AF Afghanistan
248 ALA AX Åland Islands
008 ALB AL Albania
012 DZA DZ Algeria
016 ASM AS American Samoa
020 AND AD Andorra
024 AGO AO Angola
660 AIA AI Anguilla
010 ATA AQ Antarctica
028 ATG AG Antigua and Barbuda
.
.
548 VUT VU Vanuatu
862 VEN VE Venezuela (Bolivarian Republic of)
704 VNM VN Viet Nam
092 VGB VG Virgin Islands (British)
850 VIR VI Virgin Islands (U.S.)
876 WLF WF Wallis and Futuna
732 ESH EH Western Sahara
887 YEM YE Yemen
894 ZMB ZM Zambia
716 ZWE ZW Zimbabwe