------------------------------------------------------------------------
I. Country Codes & Country Populations
------------------------------------------------------------------------
In this exercise, we will first extract a comma-separated table of
the population of countries around the world. Next, we will extact
a a table of country codes (2-letter or 3-letter). Finally, we will
"join" the two tables.

------------------------------------------------------------------------
II. Population of countries
------------------------------------------------------------------------

Our starting point is the Worldometer site
	https://www.worldometers.info/world-population/population-by-country/
We save the ouput using "text" option to file "pop_html.txt".  The
data are found between lines starting "Share" and "Source". We
extract those lines Notice the elegant way by which we get rid of
the input file once the data are extracted.

$ infile=pop_html.txt
$ gsed '1,/^Share/d;/^Source:/,$d;/^$/d' $infile > pop.out && rm $infile

$ cat pop.out 
1       China
<https://www.worldometers.info/world-population/china-population/>
1,439,323,776   0.39 %  5,540,090       153     9,388,211       -348,399        1.7     38
61 %    18.47 %
2       India
<https://www.worldometers.info/world-population/india-population/>
1,380,004,385   0.99 %  13,586,631      464     2,973,190       -532,687        2.2     28
35 %    17.70 %
.
.

Notice that each country's information starts with a regular pattern
/^[1-9][0-9]* \t/. Thus, unlike the usual records, the Record
Separator (RS) is not at the end of the text but at the beginning.
I will use the term Beginning Record Separator (BRS) to distinguish
this term from RS. Furthermore, the text for each country is spread
over the next two or three lines (it varies). So our task is to
create a record by grouping all the text between two BRSs.

The solution here is to use the "multi-line" feature of awk.
To this end we first do the following

$ gsed '/gsed '1!{/^[1-9][0-9]* \t/{x;p;x}};$G' pop.out
1       China
<https://www.worldometers.info/world-population/china-population/>
1,439,323,776   0.39 %  5,540,090       153     9,388,211       -348,399        1.7     38
61 %    18.47 %

2       India
<https://www.worldometers.info/world-population/india-population/>
1,380,004,385   0.99 %  13,586,631      464     2,973,190       -532,687        2.2     28
35 %    17.70 %

3       United States

Notice the line preceding each country is a blank line (/^$/), except for the
first country (China) and the last line of the output is a blank line.

We can now use some curious but powerful features of awk to read
such multi-line records. The input field separator (IFS) is now set to
"\n" and the record separator, RS="" (which is almost the same as
"\n\n+"). 

$ gsed '/gsed '1!{/^[1-9][0-9]* \t/{x;p;x}};$G' pop.out | \
awk -F"\n" '{gsub("/\n/"," ")}' RS="" 
1       China <https://www.worldometers.info/world-population/china-population/> 1,439,323,776  0.39 %  5,540,090       153     9,388,211       -348,399        1.7     38 61 %         18.47 %
2       India <https://www.worldometers.info/world-population/india-population/> 1,380,004,385  0.99 %  13,586,631      464     2,973,190       -532,687        2.2     28 35 %         17.70 %
3       United States <https://www.worldometers.info/world-population/us-population/> 331,002,651       0.59 %  1,937,734       36      9,147,420       954,806         1.8     38      83 % 4.25 %

Now the data for each country is one line. We will filter out the inessential parts.

$ gsed '/gsed '1!{/^[1-9][0-9]* \t/{x;p;x}};$G' pop.out | \
awk -F"\n" '{gsub("/\n/"," ")}' RS=""  | \
gsed 's/^[1-9][0-9]* \t/"/;s/,//g;s/ <.*>/", /;s/ %/%,/g;s/[0-9] /&,/g;s/,$//' > Worldometer_WP.csv

$ head -n 5 Worldometer_WP.csv
"China",  1439323776 ,	0.39%, 	5540090 ,	153 ,	9388211 ,	-348399 ,	1.7 ,	38 ,61%, 	18.47%
"India",  1380004385 ,	0.99%, 	13586631 ,	464 ,	2973190 ,	-532687 ,	2.2 ,	28 ,35%, 	17.70%
"United States",  331002651 ,	0.59%, 	1937734 ,	36 ,	9147420 ,	954806 ,	1.8 ,	38 ,	83%, 4.25%
"Indonesia",  273523615 ,	1.07%, 	2898047 ,	151 ,	1811570 ,	-98955 ,	2.3 ,	30 ,	56%, 3.51%
"Pakistan",  220892340 ,	2.00%, 	4327022 ,	287 ,	770880 ,	-233379 ,	3.6 ,	23 ,	35%, 2.83%

Now with the csv file in hand we are ready to rock-n-roll.

------------------------------------------------------------------------
Country 2-letter and 3-letter codes
------------------------------------------------------------------------
ISO provides a 3-letter country code
	https://en.wikipedia.org/wiki/ISO_3166-1_numeric

We will use the country code list provided by the International Banking
Association which can be obtained at 
$ URL="https://www.iban.com/country-codes"

$ curl -o  a.txt $URL
Review file a.txt

$ curl $URL | sed -n '/<tbody>/,/<\/tbody>/p' | gsed -E '1d;$d;s:</?t[dr]>::g;/^\t*$/d;s/\t//g' | paste - - - - | datamash reverse > ISO_country_codes.dat

Most of the sed stuff is cleaning up. However, notice the clever
use of paste to create a table with four columns (from four successive
lines of data) and also the use of datamash to reverse the ordering
of the table. [In a table it helps to have strings as the last
column].

$ head ISO_country_codes.dat; echo ".\n."; tail ISO_country_codes.dat
004	AFG	AF	Afghanistan
248	ALA	AX	&Aring;land Islands
008	ALB	AL	Albania
012	DZA	DZ	Algeria
016	ASM	AS	American Samoa
020	AND	AD	Andorra
024	AGO	AO	Angola
660	AIA	AI	Anguilla
010	ATA	AQ	Antarctica
028	ATG	AG	Antigua and Barbuda
.
.
548	VUT	VU	Vanuatu
862	VEN	VE	Venezuela (Bolivarian Republic of)
704	VNM	VN	Viet Nam
092	VGB	VG	Virgin Islands (British)
850	VIR	VI	Virgin Islands (U.S.)
876	WLF	WF	Wallis and Futuna
732	ESH	EH	Western Sahara
887	YEM	YE	Yemen
894	ZMB	ZM	Zambia
716	ZWE	ZW	Zimbabwe