------------------------------------------------------------------------ IPAC Table format ------------------------------------------------------------------------ Recently I wanted to obtain a list of IPAC database holdings (specifically lists of catalogs). After some effort I found this statement "... see the 'description' and 'catname' columns of irsa_catalogs.tbl" "allwise_p3as_psd" So I dutifully searched this file and found it $ curl -o irsa_catalogs.tbl "https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-scan?mode=ascii" This is an unwieldy file as can be gathered from this one liner. $ head -n 5 irsa_catalogs.tbl | awk '{print length}' 80 11 34 3704 3704 This file uses "IPAC Table Format" (ITF) which is not well suited to modern tools. After a day of work I developed a Unix based approach to read ITF files. The task was made enjoyable by the forays I undertook as I tried to understand the origin of this format. I learnt the history of Unix, IRAS and so on. As usual, knowledge is highly interconnected. If you are the sort who likes broad gathering of knowledge and synthesis then do read the rest of this article. You will both enjoy and learn new things. HISTORY: UNIX. Unix is now the standard operating system (OS). It triumphed over OS for Data General, DEC and IBM. The standard languages in the sixties through eighties was Fortran (for scientists) and COBOL (for business). Unix was released to Universities in 1973 and it ran only on DEC equipment. In 1978 Unix was used on 600 DEC machines in some form. The GNU project was born in 1983. The computer giant, IBM, started offering Unix in 1986. However, the installed based of Unix was small. Unix took off with the launch of Sun workstation (SPARC; 1987) and the famed Intel 30286 (1989). Unix had two distinct features. First the OS was not written in assembly language but in C and second it treated everything as a file or file system. It is the latter that gives awesome power (via pipelines). In contrast to the prevailing standards of fixed format files, Unix works on files that are organized by a simple algorithm (e.g. tab separated variable file) and it is this generality that makes it possible to use pipelines in Unix. HISTORY: IPAC (& IPAC Table Format). IPAC was created to support IRAS (launched in early 1983; operated until the end of the year). The results from IRAS were still reverberating in 1985 when I joined Caltech. IPAC created an internal standard --the IPAC Table Format (ITF) -- to support IRAS data and over time ITF was used for 2MASS (1997-2001). It is now a fixture in IRSA. ITF uses "fixed width" fields which belies its deep connection to Fortran. The fields are separated by "|" (pipe). The widths can be inferred from the first line in the descriptor block. An ITF has a pleasing appearance when printed on a 132-column printer. However, files with longer records are decidedly ugly to view. MODERN TIMES: ITF is a weird hybrid with the underlying spirit derived from Fortran but with flexibility in defining the fields (widths, number of fields, type of variable) which is reminiscent of Unix. Here, I show how IPAC Table format files can be read with Unix tools. Along the way we will discover a (obscure) feature of awk which was designed explicitly to read Fortran files! ------------------------------------------------------------------------ I. The IPAC Table Format ------------------------------------------------------------------------ The IPAC Table format has three blocks. Header & comments (which start off with "\" and "\ ", respectively. The descriptor block, with "|" serving as the bookends for each line, follows the header & comment block. The descriptor to the table requires a minimum of two lines. The first line has the names of the column and the next line the type of the value in each table (double, int, real, char, logical..) There could up two additional descriptor block lines. The data lines have no "|" whatsoever and begin after descriptor block. The key to IPAC Table format is that values for each column must lie between adjacent "|" (whose column numbers are fixed). The standard example for the IPAC Table format can be found from the IRSA homepage. $cat IPAC.tbl \catalog = 'Sample Catalog' \ A sample table comment | ra | dec | n_obs | V | SpType | | double | double | int | real | char | | deg | deg | | mag | | | null | null | null | null | null | 165.466279 -34.704730 5 11.27 K6Ve There are two solutions to analyze ITF files. The first is to substitute "|" at the appropriate locations in the data portion of the file. The resulting ".csv" file (pipe separated value file) is essentially a csv file and well suited to reading by awk or csvkit etc. The second is an arcane solution and takes advantage of "FIELDWIDTH" feature of awk (explicitly designed to read FORTRAN files). Nota bene: I found that using "|" resulted in erratic behavior. I decided to stick with ";" which seems acceptable ------------------------------------------------------------------------ IIa Convert IPAC Table to csv format ------------------------------------------------------------------------ For this solution we have to replace blank character by ";" in appropriate columns of the data file. The locations of "|" in the input file are obtained from the first line of the descriptor block. | ra | dec | n_obs | V | SpType | We assign index 1 to first character "|", index 2 to the next and so on. The last character is a "|" and has index 59. We determine the index of each "|" as follows $ PIPES=$(sed -n '3p' IPAC.tbl | gsed 's/./&\n/g' | nl | sed -n 's/|//p' | xargs) $ echo $PIPES 1 13 25 35 46 59 This piece of code does the task we set out to do $ awk -v PIPES="$PIPES" 'BEGIN{split(PIPES,ind)} \ /^ /{n=split($0,a,""); delete a[1]; \ for (i in ind){a[ind[i]]=";"}; delete a[1]; delete a[n]; \ for (i in a){b=b a[i]};print b \ }' IPAC.tbl line 1: pass the bash variable "$PIPES" to awk. In the BEGIN block, extract the values into "ind" line 2: convert input line into an array of characters using split line 3: effect the ";" substitution. delete the first and last characters of the record (which are "|", in effect) line 4: rejoin the characters of array into a single line and print line This logic forms the basis of the stand alone utility "ipac2csv" (attached). APPLICATION. Recall my starting point was to interpret an IRSA ITF file "irsa_catalog.tbl" which apparently lists all the catalog files of IRSA. $ curl -o irsa_catalogs.tbl "https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-scan?mode=ascii" Let us decode the first line of the descriptor block, using awk $ sed -n '4{p;q;}' $IRSA_FILE | tr '|' '\n' | sed 's/ //g' | nl 1 cntr 2 description 3 subtitle 4 ack 5 infourl 6 server 7 database 8 catname 9 ddname 10 ddlink 11 xpffile 12 cols 13 nrows 14 ext 15 key 16 pos 17 rel 18 coneradius 19 uploadradius 20 projectshort 21 missionid 22 groupid 23 access_flag 24 alg 25 algv 26 ex 27 where1 28 where2 29 mojoin 30 modataset 31 toperiod 32 periodresult 33 sptlevel 34 visibility 35 dbms Let us only print out "description" (2), "catname" (7), "database" (8). There are lots of blanks. Also, let us restrict to the WISE mission (for now). $ awk -F";" '{print $7,$8,$2}' OFS="\t" irsa_catalogs.csv | sed 's/ */ /g' | grep WISE | less -S wise_allwise allwise_p3as_psd AllWISE Source Catalog wise_allwise allwise_p3as_mep AllWISE Multiepoch Photometry Table wise_allwise allwise_p3as_psr AllWISE Reject Table wise_allwise allwise_p3as_cdd AllWISE Atlas Metadata Table wise_allwise allwise_p3am_xrf AllWISE Frame Cross-Reference Table wise_allwise allwise_p3al_lod AllWISE Atlas Inventory Table wise_allwise allwise_p3am_cdd AllWISE Atlas Image Inventory Table wise_allwise allwise_mfpos AllWISE Refined Pointing Information for the Single- neowiser neowiser_p1bs_psd NEOWISE-R Single Exposure (L1b) Source Table neowiser neowiser_p1ba_mch NEOWISE-R Known Solar System Object Possible neowiser neowiser_p1bs_frm NEOWISE-R Single Exposure (L1b) Frame Metada neowiser neowiser_p1bl_lod NEOWISE-R Single Exposure (L1b) Scan Invento neowiser neowiser_p1bm_frm NEOWISE-R Single Exposure (L1b) Image Inven .... ------------------------------------------------------------------------ IIb. Using FIELDWIDTH feature of awk ------------------------------------------------------------------------ We deduce the field widths from the first line of the description block $ IRSA_FILE=irsa_catalogs.tbl $ FW=$(sed -n '/^|/{s:|:& :g;s:^|: :;p;q;}' $IRSA_FILE | tr '|' '\n' | awk '{print length}' | sed '$d' | xargs | sed 's/.*/"&"/') The field widths are captured into $FW. $ echo $FW "42 303 103 83 203 33 33 43 63 103 103 13 22 13 13 5 353 12 14 43 13 13 13 43 83 203 303 303 503 43 503 43 13 12 13" Having obtained the field widths we use awk $ sed '/^\\/d' $IRSA_FILE | awk 'BEGIN{FIELDWIDTHS='$FW'}{print $7,$8,$2}' OFS="\t" | sed 's/ //g' | column -t | grep WISE Restricting to WISE we find wise_allwise allwise_p3as_psd AllWISESourceCatalog wise_allwise allwise_p3as_mep AllWISEMultiepochPhotometryTable wise_allwise allwise_p3as_psr AllWISERejectTable wise_allwise allwise_p3as_cdd AllWISEAtlasMetadataTable wise_allwise allwise_p3am_xrf AllWISEFrameCross-ReferenceTable wise_allwise allwise_p3al_lod AllWISEAtlasInventoryTable wise_allwise allwise_p3am_cdd AllWISEAtlasImageInventoryTable wise_allwise allwise_mfpos AllWISERefinedPointingInformationfortheSingle-exposureImages neowiser neowiser_p1bs_psd NEOWISE-RSingleExposure(L1b)SourceTable neowiser neowiser_p1ba_mch NEOWISE-RKnownSolarSystemObjectPossibleAssociationList(Caution) neowiser neowiser_p1bs_frm NEOWISE-RSingleExposure(L1b)FrameMetadataTable neowiser neowiser_p1bl_lod NEOWISE-RSingleExposure(L1b)ScanInventoryTable neowiser neowiser_p1bm_frm NEOWISE-RSingleExposure(L1b)ImageInventoryTable wise_allsky allsky_4band_p3as_psd WISEAll-SkySourceCatalog ....