------------------------------------------------------------------------ Query Gaia (also how to reduce XML files to tsv) ------------------------------------------------------------------------ Our goal is to have a simple command line utility that does a cone search of the Gaia data base and returns a tsv file. Why tsv? (or csv?) Because they are simple to ease (especially if you are dealing with a single object). Also you have a full toolkit "csvkit" to manipulate and display tsv/csv files. HISTORY: First optical imaging and then optical spectroscopy took off at the end of the nineteenth century. Unfortunately, astrometry remained at the arcsecond (absolute) and milli-arcsecond (relative) through the twentieth century. Accurate and precision astrometry became synonymous with radio astronomy, thanks to VLA and VLBI. In fact, the astronomical coordinate system ("International Celestial Reference System", ICRS) is rooted in radio astronomy with 3C273 providing the absolute zero point for the Right Ascension. It has been clear all along along that great advances in optical astrometry required going to space. ESA's Hipparcos was the first space mission, operating from 1989 to 1993. The transfer to geo-stationary orbit failed and operations and data analysis were reconfigured to carry out astrometry with the satellite in the transfer orbit. The final catalog consisted of over 100,000 stars with an accuracy of one mas (or slightly better) and one million stars with an accuracy of 25 mas ("Tycho"; about 11 mag). Overall, the true cost of HIPPARCOS in 2000 base was probably $1B. Despite spending considerable funds, the US missions (FAME, SIM), were not funded beyond phase B. GAIA. ESA, in its usual disciplined way, proceeded with Gaia (2013-present). This mission, providing stunning astrometry for over a million objects (mainly stars; 21 mag), single handedly revolutionized astrometry. Only with Gaia can the optical astronomers claim an absolute reference frame (which necessarily requires a suitably dense grid of extra-galactic sources) of their own. Here, I provide a simple command line interface to obtain Gaia data for a simple cone search. Unfortunately, the output of Gaia data is in XML (one of the many highly verbose formats). XML. The structure of a simple XML files is like Matryoshka (Russian) dolls. The header block is the set of lines that follow and the entries are . Within there can be different types of data. Relevant to the Gaia cone search the relevant data is Each data record is enclosed by two bookends: and . See Appendix for a sample file. ------------------------------------------------------------------------ The Mechanics of getting Gaia data (single or few objects) ------------------------------------------------------------------------ Fortunately, AIP (Germany) offers a simple old-fashioned API interface. https://gaia.aip.de/cms/documentation/cone-search/ $ IFILE="gaia.xml" $ curl "https://gaia.aip.de/conesearch/api/gaia_source/?RA=310.827805&DEC=44.638884&SR=0.002&VERB=3" \ > $IFILE Apart from RA and DEC the choices are search radius (in degrees) and the level of output (VERB=1 .. select source_id for every source; VERB=2, for selected columns and VERB=3 for all columns). What follows is a swashbuckler's approach. Step 1: Let extract the header information First, we extract the header. $ DFS="," #Choose "," or "\t" to get csv or tsv format $ gsed -n '//{x;s/\n/'$DFS'/gp;q;}' $IFILE | gsed 's/^'"$DFS"'//' >hdr.dat A brute force approach would have been to use grep to extract lines containing "/{x;s/\n/'$DFS'/gp;q;}' | gsed 's/^"$DFS"'//' >hdr.dat #2 From the XML file you can see that each entry of the header block has "". In #1 we inspect lines with this key word and isolate the key word. We store that key word in the hold buffer ("H"). The clever part is to use "d" which returns the control to the top (since there is nothing more to be done for that line). The line with "" marks the end of header block. So we now get back the collected key words and substitute $DFS for "\n" and print the pattern buffer. The next clever bit is that we are now done and so we quit. [Note we use "gsed" instead of sed because sed does not correctly interpret \n or \t characters]. Let us review the columns $ tr "$DFS" '\n'< hdr.dat | nl 1 "solution_id" 2 "designation" 3 "source_id" 4 "random_index" 5 "ref_epoch" 6 "ra" 7 "ra_error" 8 "dec" 9 "dec_error" 10 "parallax" 11 "parallax_error" ... 95 "pos" Next we extract the data. $ sed -n '//,/<\/TABLEDATA>/{s//d;/TABLEDATA/d;s/^ *//;;p;}' gaia.xml\ awk '{$1=$1;sub(/'"$DFS"'$/,"");print}' RS="\n" FS="\n" OFS=$DFS The sed portion of the above command produces an output of the sort given below. 1635721458409799680 Gaia DR2 2070085317968809216 .. False .. .. (5.4249206668417 , 0.779079113160346) The "" marks the end of each data record. Our task is to stitch this into a single line with the values separated by commas or better still tabs (see last entry). This can be done in many ways but I think the approach I have used here is not only elegant but also minimalistic. I make use of awk's ability to define arbitrary combination of characters as Record Separator (RS) or [input] Field Separators (FS) and also to define Output Field Separator (OFS). The strange statement "$1=$1" forces awk to rewrite the current record with OFS. The "sub" statement gets rids of OFS at the end of the line. See attached program for a command line utility ------------------------------------------------------------------------ Structure of VO (XML) file ------------------------------------------------------------------------
//;/
... ... ... ...
1635721458409799680 Gaia DR2 604546591854149158415.9741115548567
350.970455302769 15.9710069015852