------------------------------------------------------------------------
Query Gaia (also how to reduce XML files to tsv)
------------------------------------------------------------------------

Our goal is to have a simple command line utility that does a cone
search of the Gaia data base and returns a tsv file.  Why tsv? (or
csv?) Because they are simple to ease (especially if you are dealing
with a single object). Also you have a full toolkit "csvkit" to
manipulate and display tsv/csv files.


HISTORY:   First optical imaging and then optical spectroscopy took
off at the end of the nineteenth century. Unfortunately, astrometry
remained at the arcsecond (absolute) and milli-arcsecond (relative)
through the twentieth century. Accurate and precision astrometry
became synonymous with radio astronomy, thanks to VLA and VLBI.  In
fact, the astronomical coordinate system ("International Celestial
Reference System", ICRS) is rooted in radio astronomy with 3C273
providing the absolute zero point for the Right Ascension.

It has been clear all along along that great advances in optical
astrometry required going to space. ESA's Hipparcos was the first
space mission, operating from 1989 to 1993. The transfer to
geo-stationary orbit failed and operations and data analysis were
reconfigured to carry out astrometry with the satellite in the
transfer orbit. The final catalog consisted of over 100,000 stars
with an accuracy of one mas (or slightly better) and one million
stars with an accuracy of 25 mas ("Tycho"; about 11 mag).  Overall,
the true cost of HIPPARCOS in 2000 base was probably $1B.  Despite
spending considerable funds, the US missions (FAME, SIM), were not
funded beyond phase B.

GAIA. ESA, in its usual disciplined way, proceeded with Gaia
(2013-present).  This mission, providing stunning astrometry for
over a million objects (mainly stars; 21 mag), single handedly
revolutionized astrometry. Only with Gaia can the optical astronomers
claim an absolute reference frame (which necessarily requires a
suitably dense grid of extra-galactic sources) of their own.  Here,
I provide a simple command line interface to obtain Gaia data for
a simple cone search. Unfortunately, the output of Gaia data is in
XML (one of the many highly verbose formats).

XML.  The structure of a simple XML files is like Matryoshka (Russian)
dolls.  The header block is the set of lines that follow <TABLE>
and the entries are <FIELD name=...". The data block follows <DATA>.
Within <DATA> there can be different types of data. Relevant to the
Gaia cone search the relevant data is <TABLEDAT> Each data record
is enclosed by two bookends: <TR> and </TR>.  See Appendix for a
sample file.

------------------------------------------------------------------------
The Mechanics of getting Gaia data (single or few objects)
------------------------------------------------------------------------

Fortunately, AIP (Germany) offers a simple old-fashioned API
interface.
	https://gaia.aip.de/cms/documentation/cone-search/

$ IFILE="gaia.xml"
$ curl "https://gaia.aip.de/conesearch/api/gaia_source/?RA=310.827805&DEC=44.638884&SR=0.002&VERB=3" \
  > $IFILE

Apart from RA and DEC the choices are search radius (in degrees)
and the level of output (VERB=1 .. select source_id  for every
source; VERB=2, for selected columns and VERB=3 for all columns).

What follows is a swashbuckler's approach.
Step 1: Let extract the header information

First, we extract the header.

$ DFS=","    #Choose "," or "\t" to get csv or tsv format

$ gsed -n '/<FIELD /{s/\(^.* name=\)\("[^"]*\)\(.*$\)/\2"/;H;d;};/<DATA>/{x;s/\n/'$DFS'/gp;q;}' $IFILE | gsed 's/^'"$DFS"'//' >hdr.dat

A brute force approach would have been to use grep to extract lines
containing "<FIELD ". [I probably would have used it when I was a
novice].  However, that approach is computationally inefficient
since it will consider every line of the XML file.  This one liner
was designed to do the least amount of computation. It does not go
through the whole xml file to find the header fields.  It is worth
analyzing this code.

$ gsed -n '/<FIELD /{s/\(^.* name=\)\("[^"]*\)\(.*$\)/\2"/;H;d;};       #1
          /<DATA>/{x;s/\n/'$DFS'/gp;q;}' | gsed 's/^"$DFS"'//' >hdr.dat #2

From the XML file you can see that each entry of the header block
has "<FIELD " key word and ends with "<DATA>". In #1 we inspect
lines with this key word and isolate the key word. We store that
key word in the hold buffer ("H"). The clever part is to use "d"
which returns the control to the top (since there is nothing more
to be done for that line). The line with "<DATA>" marks the end of
header block. So we now get back the collected key words and
substitute $DFS for "\n" and print the pattern buffer. The next
clever bit is that we are now done and so we quit.  [Note we use
"gsed" instead of sed because sed does not correctly interpret \n
or \t characters].


Let us review the columns 
$ tr "$DFS" '\n'< hdr.dat | nl
 1	"solution_id"
     2	"designation"
     3	"source_id"
     4	"random_index"
     5	"ref_epoch"
     6	"ra"
     7	"ra_error"
     8	"dec"
     9	"dec_error"
    10	"parallax"
    11	"parallax_error"
    ...
    95	"pos"

Next we extract the data.

$ sed -n '/<TABLEDATA>/,/<\/TABLEDATA>/{s/<TD>//;/<TR>/d;/TABLEDATA/d;s/^  *//;;p;}' gaia.xml\
         awk '{$1=$1;sub(/'"$DFS"'$/,"");print}' RS="</TR>\n" FS="</TD>\n" OFS=$DFS

The sed portion of the above command produces an output of the sort given below.

1635721458409799680</TD>
Gaia DR2 2070085317968809216</TD>
..
False</TD>
..
</TD>
..
(5.4249206668417 , 0.779079113160346)</TD>
</TR>

The "</TR>" marks the end of each data record. Our task is to stitch
this into a single line with the values separated by commas or
better still tabs (see last entry). This can be done in many ways
but I think the approach I have used here is not only elegant but
also minimalistic. I make use of awk's ability to define arbitrary
combination of characters as Record Separator (RS) or [input] Field
Separators (FS) and also to define Output Field Separator (OFS).
The strange statement "$1=$1" forces awk to rewrite the current
record with OFS. The "sub" statement gets rids of OFS at the end
of the line.


See attached program for a command line utility

------------------------------------------------------------------------
Structure of VO (XML) file
------------------------------------------------------------------------
<?xml version="1.0"?>
<VOTABLE version="1.3"
	...
    <RESOURCE type="results">
        <TABLE>
            <FIELD name="solution_id" ucd="meta.version" datatype="long" />
            <FIELD name="designation" ucd="meta.id;meta.main" arraysize="4000" datatype="char" />
	    ...
            <FIELD name="b" unit="Angle[deg]" ucd="pos.galactic.lat" datatype="double" />
            <DATA>
                <TABLEDATA>
                    <TR>
                        <TD>1635721458409799680</TD>
                        <TD>Gaia DR2 6045465918541491584</TD>
	                ...
                        <TD>15.9741115548567</TD>
                    </TR>
                    <TR>
                    ...
		    </TR>
		    <TR>
                     ...
                        <TD>350.970455302769</TD>
                        <TD>15.9710069015852</TD>
                    </TR>
                </TABLEDATA>
            </DATA>
        </TABLE>
    </RESOURCE>
</VOTABLE>