Creating my perfect citation system using LaTeX

It's not Word and EndNote.

I'd write my article in a text editor (I like vim). When I need to cite a reference I would write something like [PMID:12345]. This would mean "cite PubMed article with ID 12345 and add to reference list".

Then I'd get citations in the right format and a reference list in the correct journal style, and there would be no more to it than that.

If for some reason I needed a searchable library of references (as opposed to PubMed, Google, Google Scholar) I could make a searchable library by running grep -R PMID: my_workspace/ | xargs convert_refs_to_library and grep within the output.

No need for elaborate reference manager functionality. I don't have time to spend all day tagging articles and arranging them into neat categories. Isn't that someone elses job?

Compare this process with using EndNote X. Pay £100. Create a library, "connect to PubMed", search for the PMID - remembering to set the drop-down to the correct field each time - incredibly annoying. Add citation to library. Click on the reference. Flip back to Word and click to insert. Pray that the OLE automation doesn't crash again. Gah!!

And before anyone tells me that Zotero, Mendeley, Reference Manager, etc. etc. are any better - I've tried them all and none can leave me alone enough to simply write papers.

I'd already decided to dump Word with all its little annoyances. I think in mark-up so let me write in mark-up. But journals don't accept HTML or XML or Markdown, it has to be PDF, RTF or Word.

LaTeX is the obvious way to proceed, but I thought it was just for mathematicians. But in desperation I downloaded MiKTeX and gave it a try.

A mark-up language in theory works so much better for writing. Text files are diff-able, you don't have to think about formatting - what typeface or font size or margins to use. There's no problem writing large documents which can cause Word to choke.

And LaTeX has a built-in reference system called Bibtex, which is extremely simple. You define references in a standard format. Each reference has a key which is what you use to cite, it can be memorable "Loman10" or in my case "pmid12345".

Journals such as PLoS ONE have a template file to make this come out nice when you render to PDF.

So, easy - so all I have to do is cite PubMed IDs in my tex file and write a little script to pull out the citations and retrieve them from Entrez. How hard can that be?

So easy in fact, there's already a Python class that does the heavy lifting: PyP2B.

So, I should be able to do something like:

myref = pyP2B()
for m in re.findall("cite\{pmid(\d+)\}", tex):
    print >>bibfile, myref.getPubmedReference(m)

Inevitably, a few issues before citation nirvana. PyP2B uses LXML for parsing but I use Python under cygwin and I can't get it to build.

No matter, I convert it to use xml.dom.minidom and py-dom-xpath (http://code.google.com/p/py-dom-xpath/)

Another problem - journal titles come out in a strange lower-case format such as "Journal of hospital infection". This seems to be a peculiarity of the Entrez API. I decide to use /PubmedArticleSet/PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA instead, this field is slightly abbreviated but can sort that out later if necessary, perhaps by looking up the ISSN of the journal.

Another problem - non-ASCII characters seem to cause pyP2B to choke. I convert pyP2B to be unicode-safe and spit out UTF-8. Still issues, Bibtex doesn't like UTF-8 characters. So I convert commonly-seen non-ASCII characters into their LaTeX symbols using the table posted on Stackexchange.

Nearly there, now.

Add \bibliography{myrefs} to my tex doc.

My Makefile looks like:

paper.pdf: paper.tex
        # first build extracts the references to paper.aux
        python gorefs.py paper.tex myrefs.bib
        rm paper.aux
        pdflatex paper
        bibtex paper
        pdflatex paper
        pdflatex paper

Typing make gives me a PDF with nicely formatted citations and references, and no stupid reference manager system needed.

Now just got to finish the paper ;)

Source code on Github.