------------------------------------------------------------------------ The world's greatest astronomer(s) ------------------------------------------------------------------------ Recently my colleague T. Padmanabhan ("Paddy") of IUCAA (also the author of a well-known series of textbooks for graduate students in Astronomy) became interested in cataloging the world's greatest scientists in XX field. He sent me an Excel spreadsheet "WorldScientist2percent.xlsx" produced by the greatest brains at Stanford which ostensibly lists the top 2% of scientists across humanity (in present times). [This file can be found in ../../LargeFiles/WorldScientist2percent.xlsx ] IUCAA, Paddy's home institution, is primarily an astronomy place. So I thought it would be useful to produce such a list for Paddy using the Stanford database. Paddy is a high quality scientist well known for his efforts in Quantizing Gravity, arguably the foremost problem in modern physics. I did not want him to waste time in printing the catalog and manually going through the catalog. Doing so would have set back progress in Quantum Gravity by a year (at least). Given the additional problems we are now facing (COVID, climate change, Trump & polarized US) it would be tragic for any such delays to occur. Fortunately, the intended task is trivial for an amateur of the Unix OS. Equally, the solution provides a useful lesson for a beginner. So I added this solution to my Unix exercise list (which I use to teach the occasional Caltech student who is interested in Unix). Step 1: I saved the Excel spreadsheet as a "tab separated format" (tsv), "S.tsv". A visual review of the file shows presence of non-ASCII characters. Some of the classic Unix tasks (e.g. tr, sort) cannot cope up with non-ASCII characters. So let us get rid of the umlauts etc. [This means the Feryal Oz\"el -> Feryal Ozl]. $ iconv -c -t UTF-8 S.tsv > Stanford.tsv The file "Stanford.tsv" is now our basic file. ------------------------------------------------------------------------ I. A Reconnaissance of the data base ------------------------------------------------------------------------ Step 2: Figure out which column is what. $ head -1 Stanford.tsv | tr '\t' '\n' | nl 1 Author Name 2 Institute Name 3 Country 4 Number of Papers 5 firstyr 6 lastyr 7 c score 8 Subject Field 9 Rank within field 10 Total authors within field [Fewer keystrokes $ csvcut -n -t Stanford.tsv $ csvcut -nt Stanford.tsv #!! ] It is useful to look at a typical entry. What better than my own. $ grep Shrinivas Stanford.tsv "Kulkarni, Shrinivas R." California Institute of Technology usa 577 1982 2020 4.021087902 Astronomy & Astrophysics 236 42624 So it appears that my first paper was published in 1982 and my last paper in 2020. I have apparently been involved in 577 papers and and a resident of "usa". According to the Stanford database there are 42,624 scientists in the field of "Astronomy & Astrophysics" and "c-score" is 4.021 and my rank is 236. Exercise: Find Paddy's entry in the data base but in one shot (problem: there are several other Padmanabhans). "Padmanabhan, T." Inter-University Centre for Astronomy and Astrophysics India ind 324 1977 2020 4.534670853 Nuclear & Particle Physics 24 110499 Not surprisingly we find Paddy up at the top of the Particle Physics group. The next step is to understand the dimensions and integrity of the database. To this end we use the Swiss-army knife, "datamash". $ datamash check < Stanford.tsv 159684 lines, 10 fields So the number of records is 159,684 and the database is uniform (every record has the same number of fields). This means that our one-liner need not worry about making exceptions. Finally, out of curiosity let us look at the fields (classifications) and the corresponding fraction but expressed as ppt (parts per thousand). So we need to bin the data by an attribute (the field) -- a task that is elegantly suited to associative arrays of awk. $ sed 1d Stanford.tsv | \ awk -F"\t" '{a[$8]++; ntot=NR} \ END{for (i in a) \ {printf "%-45s\t%d\t%-0.2f\n", i,a[i],a[i]/ntot*1000}}' | sort | nl # Field Number ppt 1 751 4.70 2 "Art Practice, History & Theory" 33 0.21 3 "History of Science, Technology & Medicine" 49 0.31 4 "Sport, Leisure & Tourism" 132 0.83 5 "Strategic, Defense & Security Studies" 360 2.25 6 Accounting 95 0.59 7 Acoustics 599 3.75 8 Aerospace & Aeronautics 979 6.13 . . The database admits of 175 fields. Apparently, the Stanford scientists were not able to classify the field for 751 people. There are 1658 astronomers who amount to 10.38 ppt (parts per thousand) or 1.038% of the total number of entries in the database. This fraction can be compared with 3% for Applied Physics & 4.3% for Neurology and 0.04% for Folklore & 0.17% for Geology (!). ------------------------------------------------------------------------ II. Analysis: Astronomy ------------------------------------------------------------------------ Rather than deal with the full database let us now focus on astronomers. $ grep "Astronomy & Astrophysics" Stanford.tsv > Astro_Stanford.tsv $ wc -l !$ 1658 Astro_Stanford.tsv There are 1658 entries for Astronomy. Earlier, we had learnt that according to the database the number of astronomers world wide is 42,624 - a rather staggering number for professional star-gazers). Thus, it appears that the database has 3.8% of the top ranked astronomers (as opposed to what Paddy informed me - 2%). Exercise: Using datamash determine min, max, q1, q2, q3 of the scores for Astronomy. Moving on we wish to construct a rank-ordered list of astronomers $ sort -t $'\t' -k9n Astro_Stanford.tsv|sed 's/"//g' | \ awk -F"\t" '{print $9,$1,$2,$3}' OFS="\t" > paddy.tsv As all aficionados know, the key metric in a one-liner is the (smallest) number of key strokes. This exercise is instructive for observers who have to deal with real-life data. Motivated thus, I break it into steps and offer an explanation for each step: $ sort -t $'\t' -k9n | #1 sed 's/"//g' | #2 awk -F"\t" '{print $9,$1,$2,$3}' OFS="\t" #3 > paddy.tsv #4 #1 We wish to sort on column 9 (rank). We need to inform sort that the file a tab separated file. The character following the option -t should be a tab. However, sort is an old program and so does not accept the modern representation for tab "\t". We have two choices: -t $'\t' (modern bash, ksh, zsh) or going back to the basics (Unix arcana) use -t "cntrl+v tab" (control+v followed by tab; as in vi). This is the sort of arcana that gives Unix a bad name. #2 Get rid of annoying double quotes around the names #3 rearrange the four columns for printing using "\t" as the output field separator (OFS="\t") #4 in honor of Paddy, the output file is named "paddy.tsv" "paddy.tsv" is a handy file and you are now ready to explore the world of the greatest astronomers. It can be found in the Appendix. ------------------------------------------------------------------------ III. Explore the greats of modern astronomy ------------------------------------------------------------------------ Let us first histogram astronomer by their country of residence. $ awk -F"\t" '{print $NF}' paddy.tsv | sort | uniq -c | sort -k1n 1 hrv 1 prt 1 ukr 2 are 2 cze 2 grc 2 hun 3 arg 3 aut 3 irl 3 nor 4 fin 4 hkg 4 ind 4 twn 5 bra 6 bel 7 kor 7 mex 11 chl 11 pol 11 swe 12 dnk 12 zaf 16 chn 17 rus 22 che 24 esp 28 isr 36 jpn 40 nld 50 aus 55 can 73 ita 78 fra 110 deu 160 gbr 828 usa [These are ISO-standard 3-letter does for country. See https://www.iban.com/country-codes ] [How about this instead (saves key strokes!) $ csvcut -H -t -c4 paddy.tsv | sort | uniq -c | sort -k1n $ csvcut -Htc4 paddy.tsv | sort | uniq -c | sort -k1n #? no comprehendo!? I was not aware of astronomy in Finland. So $ grep 'fin$' paddy.tsv 343 Usoskin, Ilya, Oulun Yliopisto, fin 820 Salo, H., Oulun Yliopisto, fin 832 Muinonen, Karri, Helsingin Yliopisto, fin 1151 Juvela, Mika, Helsingin Yliopisto, fin Perhaps you are visiting China next month. You want to be tactful or devilishly clever. $ grep ' chn$' paddy.tsv 16 Ho, L. C. Peking University chn 340 Jing, Y. P. Shanghai Jiao Tong University chn 612 Han, Zhan Wen National Astronomical Observatories Chinese Academy of Sciences chn 696 Dai, Z. G. Nanjing University chn 842 Hobbs, George National Astronomical Observatories Chinese Academy of Sciences chn 960 Yuan, F. Shanghai Astronomical Observatory Chinese Academy of Sciences chn 1003 Li, Hui Jiangxi Science and Technology Normal University chn 1006 Ma, Jun Ling Beijing Cancer Hospital chn 1017 Zhao, Gong Bo National Astronomical Observatories Chinese Academy of Sciences chn 1056 Li, Li Xin Peking University chn 1098 Cui, W. Tsinghua University chn 1117 Qian, Shengbang Chinese Academy of Sciences chn 1132 Bai, Xue Ning Tsinghua University chn 1344 Zhu, Zong Hong Beijing Normal University chn 1430 Zhang, Shuang Nan Institute of High Energy Physics Chinese Academy of Science chn 1463 Chen, P. F. Nanjing University chn As you can see I am already reaping the benefits of my modest investment of time. Mark Twain, were he to be alive today, would certainly have said "There is something fascinating about Unix. One gets such wholesale returns of conjecture out of such a trifling investment of key strokes." A histogram of the scores would give the reader some insight into the precision of the scores. $ cut -f7 paddy.tsv | datamash --full bin:0.05 1 | cut -f2 | sort -k1nr | uniq -c | awk '{a+=$1; print $2 "\t" $1 "\t" a}' score # cumulative 4.85 1 1