------------------------------------------------------------------------
Co-authors: a histogram
------------------------------------------------------------------------

It was almost six years ago that I wrote an NSF proposal (it was
the proposal that led to ZTF). When preparing the NSF "pre-proposal"
for ZTF Phase II I came to learn that NSF now demands that the
bibliography list ALL authors.  Presumably this helps them to
strictly address a possible nepotism issue, namely, your co-authors
are likely to be kinder to your proposal over other proposals.  This
is reasonable in principle but given that LIGO papers have the
entire astronomical community (and beyond - as in some authors who
have gone to heaven) the nepotism rule will exclude all astronomers
from refereeing any astronomy proposal.

In any case, this new rule made me thing: what is the distribution
(histogram) of my co-authors?

I am writing up this exercise to (1) demonstrate that Unix is very
well suited to such problems and (2) to again highlight that ADS
is a marvelous database.


----------------------------------------
Getting the data
----------------------------------------

With new ADS I formed a list of my papers and stored it as a private
library.  It is not useful to include community papers (e.g. LSST
definition paper, LIGO paper on GW170817 etc). Otherwise your
author list will become equal to the size of a country.

ADS offers an option to rank order the papers by the number of
authors. Do so and then lop off papers with more than xx authors
(xx is your choice).

I excluded the following papers of which I am a co-author:

1. LSST: From Science Drivers to Reference Design and Anticipated
	Data Products (327 authors, 1101 citations)
2. LSST Science Book, Version 2.0 (247, 1348) 
3. Multi-messenger Observations of a Binary Neutron Star Merger
  	(3674, 1176)
4. The Swift Gamma-Ray Burst Mission (71, 2447) 
5.  The Detection of a Type IIn Supernova in Optical Follow-up
	Observations of IceCube Neutrino Events (38, 33)

The first four are certainly gratuitous.

I exported my library using the following "Custom Format" 
	%l (%Y), %j, %V, %p.\n
[In order to understand this format you need to read up ADS help
pages]. The result is one line per paper.  I saved the results in
"SRK_authors.txt" An example line is

Backer, D. C., Kulkarni, S. R., Heiles, C., Davis, M. M., & Goss, W. M. (1982), \nat, 300, 615.

You then execute
$ ./authorsphere SRK_author.txt > SRK_histogram
[where "authorsphere" can be found at the end of this article]

For the purpose of explanation I have parsed the program line
by line.

$ sed 's/ ([12][0-9][0-9][0-9]).*$//;s/&//' SRK_authors.txt |    #1
   awk -F "[.]," '{for (i=1;i<=NF;i++){print $i}}' |             #2
   sed 's/^  *//;s/\..*//'  |                                    #3
   sort -t ","             |                                     #4
   tee authors.txt        |                                      #5
   uniq -c               |                                       #6
   sort -nr   > SRK_histogram                                    #7



STEP 1: Our goal is, for each line, to retain only authors. To
achieve this we delete all characters following year of publication
to the end of the end of the line. The regular expression "
([12][0-9][0-9][0-9])" precisely identifies the year. The regular
express ".*$" stands for all characters to the end of the line. I
also deleted "&". The output for the sample line is then

Backer, D. C., Kulkarni, S. R., Heiles, C., Davis, M. M., Goss, W. M.

STEP 2: Next, we need to "extrac" authors from each line.  We notice
that ".," separates the authors from one another.  In Unix jargon,
the field separator, "F" is ".,".  However, since "." is a
meta-character in regular expression algebra you have to express
the "field separator" as -F "[.]," (this is Unix arcana, sorry).
The output is

Backer, D. C.
 Kulkarni, S. R.
 Heiles, C.
 Davis, M. M
  Goss, W. M.

STEP 3. For analysis of words or names, it is important the list
of words follow a fixed rule.  Notice that all names after the first
the author start with a blank or blanks. Next,  the name of the
last but one author does not end with a period ("."). We fix these
two inconsistencies in one step. (Notice how "." is escaped this
time by "\").  The output is

Backer, D 
Kulkarni, S
Heiles, C
Davis, M
Goss, W

STEP 4. Now, we are finally ready for the analysis.  A simple sort
(sort default: alphabetical ordering of first column) is applied
to the first field.  However some last names have a blank characters
(e.g. van Kerkwijk). So the sort key uses "," as the field separator,
hence "sort -t ",". The output is

...
Yan, L
Yan, L
Yan, L
Yan, L
Yang, C
Yang, M
Yang, T
Yang, T
Yao, Y
Yao, Y
Yao, Y
...

[NOTE: Here, you should review the list and see if there are some
categories you would like to exclude.  For instance, some papers
have authors such as XYZ "Team", "Consortium" or "Collaboration".
Perhaps you wish to get rid of them. You can do that easily by
	sed '/Team/d;/Consortium/d;/Collaboration/d'
]


STEP 5. Using "tee" to store the master list of all co-authors in file
"authors.txt" and pass the stream to the next utility.

STEP 6. uniq -c counts the occurrences of the same pattern. The
output is

...
  23 Yan, L
   1 Yang, C
   1 Yang, M
   2 Yang, T
   3 Yao, Y
...

STEP 6. It would be useful to organize the histogram in descending order.
To this end we sort on the first column (default) but undertake a
numerical sort (hence, "-n") and also in descending order ("-r").
Voila, the output is

 573 Kulkarni, S
 157 Frail, D
 152 Cenko, S
 151 Kasliwal, M
 143 Ofek, E
 124 Gal-Yam, A
 117 Bloom, J
 115 Nugent, P
  80 Laher, R
  78 Fox, D
  75 Berger, E
  72 Quimby, R
  59 Arcavi, I
  57 Sullivan, M
  56 Law, N
  53 Masci, F
  52 Filippenko, A
  51 Prince, T
  50 Surace, J
...
This file is written to "SRK_histogram.txt"

------------------------------------------------------------------------
$cat authorsphere
------------------------------------------------------------------------
#!/bin/bash

#./authorsphere input.txt 

if [[ -z $1 ]]; then
    echo "exit: need list of papers file" 
    exit -1
fi
infile=$1 


sed 's/ ([12][0-9][0-9][0-9]).*$//;s/&//' $infile  |     \
   awk -F "[.]," '{for (i=1;i<=NF;i++){print $i}}' |     \
   sed 's/^  *//;s/\..*//'                         |     \
   sort -t ","                                     |     \
   tee authors.txt                                 |     \
   uniq -c                                         |     \