------------------------------------------------------------------------ lexical collation of LaTeX .bib file ------------------------------------------------------------------------ This problem was posed by Yuhan Yao, Caltech "Could you please sort my bib file alphabetically by the last name of the first author?" The bibfile in question is at2019dge.bib. The structure of the file supplied by Yuhas has, for each reference, a bibtex entry ending with a zero-character line ("blank line"). entry. Yuhan provides a key on the first line of each entry is the first line has the first authors name neatly captured (by Yuhan). The goal is to alphabetically sort this file, using the first author's last name as the sorting key. The file at2019dge.bib can be found at http://www.astro.caltech.edu/~srk/SRKUnix/Examples/at2019dge.bib This file contains multi-line records. The record separator, RS, is "" ("blank line"). $ cat at2019dge.bib @ARTICLE{Zou2017, author = {{Zou}, Hu and {Zhang}, Tianmeng and {Zhou}, Zhimin and {Nie}, Jundan and {Peng}, Xiyan and {Zhou}, Xu and {Jiang}, Linhua and {Cai}, Zheng and {Dey}, Arjun and {Fan}, Xiaohui and {Fan}, Dongwei and {Guo}, Yucheng and {He}, Boliang and {Jiang}, Zhaoji and {Lang}, Dustin and {Lesser}, Michael and {Li}, Zefeng and {Ma}, Jun and {Mao}, Shude and {McGreer}, Ian and {Schlegel}, David and {Shao}, Yali and {Wang}, Jiali and {Wang}, Shu and {Wu}, Jin and {Wu}, Xiaohan and {Yang}, Qian and {Yue}, Minghao}, title = "{The First Data Release of the Beijing-Arizona Sky Survey}", journal = {\aj}, .. adsnote = {Provided by the SAO/NASA Astrophysics Data System} } @MISC{Bellm2016, author = {{Bellm}, Eric C. and {Sesar}, Branimir}, .. adsnote = {Provided by the SAO/NASA Astrophysics Data System} } @ARTICLE{Arcavi2017, author = {{Arcavi}, Iair and {Hosseinzadeh}, Griffin and {Brown}, Peter J. and .. adsnote = {Provided by the SAO/NASA Astrophysics Data System} } .. There are two subtleties: 1. In any alphabetical sort, capital versus lower case matters (because of the differing locations occupied by A-Z and a-z). So it is best to make a "case insensitive" sort. 2. Yuhan's file has the not-uncommon problem: the last record does not end with a RS. I added a blank line. ------------------------------------------------------------------------ Solution 1: Use Unix tools to create index file & awk to write output ------------------------------------------------------------------------ This solution requires two passes at the data. Our first job is to create a list of keyword for each paper $ sed -n 's/^@.*{//p' at2019dge.bib | nl 1 Zou2017, 2 Bellm2016, 3 Arcavi2017, 4 Arnett1982, 5 Astropy-Collaboration2013, 6 BC03, We sort on the second column ("-f" is "fold" which is same as "case insenstive") and store the record number in a file index.list $ sed -n 's/^@.*{//p' at2019dge.bib | nl | sort -k2 -f | cut -f1 | tee index.list 3 4 5 6 2 ... We construct an awk program to read in the index file, read the bib file and write out the bib references in the order given by the index file $ cat bib.awk Note some awk arcana -- "ind[i]+0" is needed to ensure ind[i] is intrepreted as an integer number. $ awk -f bib.awk at2019dge.bib BEGIN{while((getline<"index.list")>0){ind[++i]=$0};RS="";ORS="\n\n"} #read in index.list {rec[++j]=$0} #read bib file END{for (i=1;i<=length(ind);i++){print rec[ind[i]+0]}} #print in order of index.list An alternative which does the same job is as follows (and is shorter!): $ awk -f bib2.awk index.list at2019dge.bib $ cat bib2.awk BEGIN{RS="";FS=","; ORS="\n\n"} FNR==NR{for (i=1;i<=NF;i++) {ind[i]=$i+0}} #make clever use of RS="" to read entire file FNR!=NR{rec[++j]=$0} END{for (i=1;i<=length(ind);i++) {print rec[ind[i]]}} ------------------------------------------------------------------------ Solution II: Only one pass but no unix tools, only awk ------------------------------------------------------------------------ $ awk -f sortbib.awk at2019dge.bib where $ cat sortbib.awk BEGIN {RS="";FS=","} {rec[++i]=$0 a=$1; sub("^@.*{","",a) # a=Zou2017 b[i]=a", "i # b[1]=Zou2017, 1 } END { n=asort(b) #sort b[] alphabetically for (k=1;k<=n;k++) { m=split(b[k],outb,","); #outb(m)=3 (corresponding to Arcavi2017) ind=outb[m]; print rec[ind+0] "\n" #subtlety: coerce ind to behave as integer } } ------------------------------------------------------------------------ Solution III: A solely Unix solution (a triumph!) ------------------------------------------------------------------------ $ gsed -e 's/\(^@.*{\)\(.*$\)/\2 \1\2/' #1a -e 's/^$/\x0/' at2019dge.bib | #1b sort -z | #2 sed '/@/s/^[^ ]*, //' | #3 tr -d '\000' | #4 sed 1d > alpha_sort.bib #5 Step #1a Using "playback" feature in sed, extract key word and start the first line in each bib record Step #1b In the input file a blank line "^$" separates bibliographic records from each other. Replace this line with a NUL character (\x00 which is visually displayed as ^@) IMPORTANT: must used "gsed" since "sed" does not deal with NUL Bellm2016, @MISC{Bellm2016, author = {{Bellm}, Eric C. and {Sesar}, Branimir}, title = "{pyraf-dbsp: Reduction pipeline for the Palomar Double Beam Spectrograph}", keywords = {Software}, year = "2016", month = "Feb", eid = {ascl:1602.002}, pages = {ascl:1602.002}, archivePrefix = {ascl}, eprint = {1602.002}, adsurl = {https://ui.adsabs.harvard.edu/abs/2016ascl.soft02002B}, adsnote = {Provided by the SAO/NASA Astrophysics Data System} } ^@ Zou2017, @ARTICLE{Zou2017, author = {{Zou}, Hu and {Zhang}, Tianmeng and {Zhou}, Zhimin and {Nie}, Jundan and ... Step #2 Sort on the first work (key word). The flag "-z" forces sort to consider \x00 as the record separator instead of the usual \nl Step #3 Now that the file has been sorted, delete the keyword that was inserted in step 1 Step #4 Delete NUL. The replacement is, by default, "^$" Step #5 It appears that an extra-line gets inserted. Delete it.