------------------------------------------------------------------------ Co-authors: a histogram ------------------------------------------------------------------------ It was almost six years ago that I wrote an NSF proposal (it was the proposal that led to ZTF). When preparing the NSF "pre-proposal" for ZTF Phase II I came to learn that NSF now demands that the bibliography list ALL authors. Presumably this helps them to strictly address a possible nepotism issue, namely, your co-authors are likely to be kinder to your proposal over other proposals. This is reasonable in principle but given that LIGO papers have the entire astronomical community (and beyond - as in some authors who have gone to heaven) the nepotism rule will exclude all astronomers from refereeing any astronomy proposal. In any case, this new rule made me thing: what is the distribution (histogram) of my co-authors? I am writing up this exercise to (1) demonstrate that Unix is very well suited to such problems and (2) to again highlight that ADS is a marvelous database. ---------------------------------------- Getting the data ---------------------------------------- With new ADS I formed a list of my papers and stored it as a private library. It is not useful to include community papers (e.g. LSST definition paper, LIGO paper on GW170817 etc). Otherwise your author list will become equal to the size of a country. ADS offers an option to rank order the papers by the number of authors. Do so and then lop off papers with more than xx authors (xx is your choice). I excluded the following papers of which I am a co-author: 1. LSST: From Science Drivers to Reference Design and Anticipated Data Products (327 authors, 1101 citations) 2. LSST Science Book, Version 2.0 (247, 1348) 3. Multi-messenger Observations of a Binary Neutron Star Merger (3674, 1176) 4. The Swift Gamma-Ray Burst Mission (71, 2447) 5. The Detection of a Type IIn Supernova in Optical Follow-up Observations of IceCube Neutrino Events (38, 33) The first four are certainly gratuitous. I exported my library using the following "Custom Format" %l (%Y), %j, %V, %p.\n [In order to understand this format you need to read up ADS help pages]. The result is one line per paper. I saved the results in "SRK_authors.txt" An example line is Backer, D. C., Kulkarni, S. R., Heiles, C., Davis, M. M., & Goss, W. M. (1982), \nat, 300, 615. You then execute $ ./authorsphere SRK_author.txt > SRK_histogram [where "authorsphere" can be found at the end of this article] For the purpose of explanation I have parsed the program line by line. $ sed 's/ ([12][0-9][0-9][0-9]).*$//;s/&//' SRK_authors.txt | #1 awk -F "[.]," '{for (i=1;i<=NF;i++){print $i}}' | #2 sed 's/^ *//;s/\..*//' | #3 sort -t "," | #4 tee authors.txt | #5 uniq -c | #6 sort -nr > SRK_histogram #7 STEP 1: Our goal is, for each line, to retain only authors. To achieve this we delete all characters following year of publication to the end of the end of the line. The regular expression " ([12][0-9][0-9][0-9])" precisely identifies the year. The regular express ".*$" stands for all characters to the end of the line. I also deleted "&". The output for the sample line is then Backer, D. C., Kulkarni, S. R., Heiles, C., Davis, M. M., Goss, W. M. STEP 2: Next, we need to "extrac" authors from each line. We notice that ".," separates the authors from one another. In Unix jargon, the field separator, "F" is ".,". However, since "." is a meta-character in regular expression algebra you have to express the "field separator" as -F "[.]," (this is Unix arcana, sorry). The output is Backer, D. C. Kulkarni, S. R. Heiles, C. Davis, M. M Goss, W. M. STEP 3. For analysis of words or names, it is important the list of words follow a fixed rule. Notice that all names after the first the author start with a blank or blanks. Next, the name of the last but one author does not end with a period ("."). We fix these two inconsistencies in one step. (Notice how "." is escaped this time by "\"). The output is Backer, D Kulkarni, S Heiles, C Davis, M Goss, W STEP 4. Now, we are finally ready for the analysis. A simple sort (sort default: alphabetical ordering of first column) is applied to the first field. However some last names have a blank characters (e.g. van Kerkwijk). So the sort key uses "," as the field separator, hence "sort -t ",". The output is ... Yan, L Yan, L Yan, L Yan, L Yang, C Yang, M Yang, T Yang, T Yao, Y Yao, Y Yao, Y ... [NOTE: Here, you should review the list and see if there are some categories you would like to exclude. For instance, some papers have authors such as XYZ "Team", "Consortium" or "Collaboration". Perhaps you wish to get rid of them. You can do that easily by sed '/Team/d;/Consortium/d;/Collaboration/d' ] STEP 5. Using "tee" to store the master list of all co-authors in file "authors.txt" and pass the stream to the next utility. STEP 6. uniq -c counts the occurrences of the same pattern. The output is ... 23 Yan, L 1 Yang, C 1 Yang, M 2 Yang, T 3 Yao, Y ... STEP 6. It would be useful to organize the histogram in descending order. To this end we sort on the first column (default) but undertake a numerical sort (hence, "-n") and also in descending order ("-r"). Voila, the output is 573 Kulkarni, S 157 Frail, D 152 Cenko, S 151 Kasliwal, M 143 Ofek, E 124 Gal-Yam, A 117 Bloom, J 115 Nugent, P 80 Laher, R 78 Fox, D 75 Berger, E 72 Quimby, R 59 Arcavi, I 57 Sullivan, M 56 Law, N 53 Masci, F 52 Filippenko, A 51 Prince, T 50 Surace, J ... This file is written to "SRK_histogram.txt" ------------------------------------------------------------------------ $cat authorsphere ------------------------------------------------------------------------ #!/bin/bash #./authorsphere input.txt if [[ -z $1 ]]; then echo "exit: need list of papers file" exit -1 fi infile=$1 sed 's/ ([12][0-9][0-9][0-9]).*$//;s/&//' $infile | \ awk -F "[.]," '{for (i=1;i<=NF;i++){print $i}}' | \ sed 's/^ *//;s/\..*//' | \ sort -t "," | \ tee authors.txt | \ uniq -c | \