Scientifc writing in Latex

11.03.2019

LatexWriting

Scales

The question of “What is the best writing tool for academic writing?” pops up quite a lot. At least in the crowd I hang out with. People seem to have strong opinions on the subject and it stirs up similar fervor as questions like “What is the best operating system?” or “What is the best editor?“.

I’ve done my fair share of sermonesque proclamation regarding software choices, so I’ll try to leave that out from this post. Instead, I hope to illustrate how adopting a writing toolchain based on a plain text markup language with typesetting may be more favourable in scientific writing compared to using a word processor. I’ll focus on the Latex & MS Word juxtaposition here, but many of the arguments apply for other markup languages (e.g. markdown, asciidoc or restructured text) and word processors (e.g. Libre Office). Since Latex is a bit archaic, some of the new kids on the block are pretty appealing. Especially markdown+pandoc based workflows. But that is beyond the scope of this post.

The usual arguments

The first arguments spouted out by Latex-zealots are

While I do agree with all of these points, I think people clinged onto Latex should sometimes try recent versions of MS Word. It does a fairly decent job regarding all the before mentioned tasks. The template argument is also losing its validity since some journals seem to be prefer .docx. It also seems that some publishers flat out refuse to work with anything except MS Word (discussed in the comments of “Writing a Book with Unix” post on Hacker News). Furthermore, people who truly know how to use MS Word (macro-magic, Track Changes etc.) can be amazingly effective. Consequently, the three arguments listed above are not compelling enough to persuade someone over to the «Lah-tech» side. Given that the learning curve of Latex is steeper compared to that of MS Word, the arguments for using it need to be more convincing. For some further arguments both for and against Latex I think this reddit post contains most of the common ones. I also agree with this highly scientific comparison done by Dheepak Krishnamurthy.

Learning curve

Like any change, adopting a non-word-processor-based workflow requires quite a bit more time and effort than sticking with your favorite word processor. The workflow is also initially somewhat off-putting because on the surface it appears more complex and requires a shift in the way one thinks about documents. However, traditional word processors are in fact also complex. The complexity is just hidden away from the user. When you need to modify, extend or augment your workflow, being locked into a specific word processor can be restrictive. Especially if you are inclined to or interested in writing your own tools and automation routines. If you feel that you will never have such an inclination or interest, then sticking with MS Word is probably the way to go for you.

Modularity

In order build a stronger case for Latex, I think we first have to sell the idea of modularizing documents. Modularity is an established idea in systems engineering. In many (or even most) situations it’s favourable to design systems composed of interchangeable and somewhat independent modules. In software development the choice of adopting modularity is usually a no-brainer. For example, it’s one of the big selling points for microservice based architectures.

So how does modularity apply to scientific writing? Well if you have played around with markup languages you know that figures and tables are usually in separate files, and they are called using commands like \input or \includegraphics. This encourages dividing the document into smaller chunks (or modules) which is more manageable especially when working with longer documents. While word processors can be used in a modular way to an extent, they definitely primarily encourage the writing of monolithic documents. If you’ve ever had to compile a report from several docx into one main one, you know that this can be a nightmare in MS Word. I guess it is somewhat ironic that, if you take a look inside of a .docx it is a modularized XML based structure.

In the Latex workflow, we can start to extend modularity. In practical terms, documents are composed of different categories of modules. In my view there are three categories. Firstly, documents naturally contain content such as text, figures and tables. I’ll call these content modules. Secondly, there are generative modules which are basically scripts and other pieces of software that produce the markup for the text, figures and tables. For example the .awk script I wrote about for generating tables out of .csv files is a generative module. I prefer to keep the software related to analysis separate from the documents. I usually use .csv as an intermediate file format since it is easy to work with. You could draw the line elsewhere and for example generate the .tex or .tikz directly as the output of your analysis. With figures, I tend to prefer using .pdf as it is accepted by most journals. Sometimes using a .csv based “recipe” for figures is a good idea but this depends on the complexity of the figure and analysis etc. Usually I try configure my figure generation so that the figure has the same font as the manuscript text.

The final category is meta-modules which contains things like references, notes and comments. These are the modules that are not strictly speaking content but appear alongside of it. References are wonderfully easy to work with in Latex. I use Zotero with the better-bibtex plugin. Notes are also fairly easy to work with and add. I usually have the notes in the .tex source file as comments. The ability to comment and uncomment sections of the .tex file has become invaluable to me. Some prefer to have notes in separate folder for example separating the literature review, but I don’t usually bother with this.

The trickiest type of meta-module is comments. Here I’m referring comments from your co-authors. While you could just write the comments into .tex source file, this can get messy. Luckily, Git and hosting services for Git repositories help us here. I’ll discuss this in further detail in the section on version control.

When all the content, generative and meta modules are interchangeable and independent they can be easily worked on and improved individually. This has some similarities to software development methodology in the sense that it facilitates separation of concerns. While figures and tables are nearly always developed and generated separately from the rest of the document, using Latex means there’s little to no effort needed when integrating new changes from different sources into new revisions of the main document. The automation potential means that tedious manual work can be minimized.

Version control

One major benefit of working with plain text documents is that they work with Git. While MS Word has fairly sophisticated version control features like Track Changes, I prefer Git (I’m biased here). The distributed version control style really works quite well with writing documents.

There are a couple of positive byproducts that come out of using the git-add-commit-push flow. The first one is that in order for Git to work well you should write one sentence per line. For me this is great since it caps the max length of a sentence and thusly boosts readability. Sometimes longer sentences bleed over but in general I manage to stick to one-liners. Of course this rule might break between different editor setups and line wrapping configurations..

The second positive byproduct is that using Git encourages focusing on smaller sections at a time and to give each increment of work a title (in the form of a commit message). I have the negative habit of pointlessly scrolling up and down a document instead of focusing on a single paragraph or sentence. At least for me, the target of a commit, helps to focus my efforts on a single topic. This approach has some similarities to the Pomodoro technique, but I don’t usually use a timer. Additionally, it helps to keep motivation up when you see the git log of what you (or your co-authors) have been doing.

So I mentioned that a functionality that replaces the MS Word comment sidebar can be achieved with Git. Well If we have our repository on a hosting service like Github or Gitlab (or even Gitea) we can harness the power of merge requests and the snazzy UI to merge proposed changes. You can also use something like latexdiff or git-latexdiff to visualize the diff between revisions. Furthermore, modern online repo hosting platforms offer a lot of functionality regarding discussion and collaboration with issues, merges and commits. I think this is superior to the “docx as attachment” approach since with an online git repo everybody can see and discuss the changes and thusly avoid duplicate or contradicting change proposals. Of course, Track Changes can handle many of the version conflicts and It’s GUI-y by default. I won’t propose a specific communication strategy since the platforms support a variety approaches. I recommend outlining “How to contribute” in the README.

Some scenarios

So how does my proposed workflow trump MS Word in practice? I’ll try to describe three scenarios.


1. Need to update multiple figures/tables

MS Word: You need to individually update each table and figure. You can make your life a tad bit easier by linking to figures instead of embedding them.

Latex+Git: If you’ve setup your generative modules well, you just need to rerun them in order to generate the new figures and tables. You can even automate the regeneration.


2. Co-author #1 says this way and co-author #2 says that way

MS Word: You have mediate between the two (either in person or via email). The two might discuss between themselves but if you are the first author you might want to be privy to these conversations. This can lead to really long email threads. In the end you have to resolve the conflicts and unless you have enabled Track Changes this can be fairly tedious.

Latex+Git: The two co-authors can work on their own forks via the repo hosting service. They see each other’s edits, and they see which changes you merge with master. The changes can be discussed on a per revision basis. Everybody sees the comments is able discuss about them (and the discussion is not locked to an awkward side bar or email). Issues can be assigned to different authors if necessary.


3. You have a large research group and you need to produce a final report NOTE: I have never been able to convince a large research group to adopt a Git+Latex workflow, so this scenario has only played out in my imagination

MS Word : Everyone writes their own .docx files. They can be hosted on some cloud service but they are still separate entities. Some poor soul has to combine these in to one report and try to maintain consistency.

Latex+Git : The repository for the project is visible to everyone on the hosting service. Ideally the main final report can be built (to pdf / html) between each commit. Something similar to Readthedocs would be really interesting. A hyperlinked online report is in my opinion a better format for such documents compared to a single .pdf. With this workflow it is trivial to generate both.


These scenarios are simplified and one could go into lengthy analysis about how to achieve better functionality with either approach or variants thereof. I guess for me at least, the singular main advantage is that Latex integrates into the *nix-verse way better than MS Word. This opens up a treasure trove of possibilities regarding automation and expandability.

The non-Latex, non-Git co-author dilemma

I saved this one for last because it is a killer. It is inevitable, that you will work with people who refuse to learn the workflow and tools proposed above. While excluding these people entirely from being your co-authors is seductive, it is probably not a sustainable strategy. So for people “who don’t have time for our shit” we have to have a way of converting back and forth from .docx to our main format .tex. I’m afraid that for this there is only a spectrum of suboptimal solutions and nothing that really hits the mark (that I’ve found thus far at least).

First of all, when you send your manuscript to your co-author you should create a git branch for that revision / comment round. That way it is easier to merge later and you can keep working on your master branch.

For conversion there are a couple options that I’ve found produce moderately workable outputs. The first option is to simply open the pdf output of pdflatex in MS Word. I know this sounds horrible, but MS Word can actually convert the .pdf to .docx reasonably well. You might need to do some clean up (especially if you have math expressions) but it works ok. The road back to .tex is more difficult and usually requires a bunch manual editing.

The second option is to use pandoc to convert between .tex and .docx. Raphael Kabo wrote a nice outline on such a conversion workflow (based on markdown instead of Latex). Using a reference .docx and citeproc with a csl file are good practices here. I don’t have access to a Windows installation right now, so I can’t play around with these. In my own experience unless your .tex is really simple the conversion will not be great. Markdown seems to be a better intermediary format compared to .tex with pandoc. Taking advantage of Track Changes on these comment rounds is useful. I recommend “digesting” the revisions inside MS Word. I usually pretty much accept all revisions and then drop the ones I don’t agree with during merging. That way they propagate to Git. After applying the changes I convert to .tex, clean up, apply the changes onto the branch you created when sending the document and merge.

Having to accommodate .docx conversions is definitely the biggest source of hassle. I think there might be a need for a tool here. Maybe something similar to what J. Alexander Branham described in his post on importing from word with pandoc. An utility that would produce a patch that could be applied directly to the .tex files. Again following the one sentence per line rule. This utility would also have to collect and place the comments from the .docx in the right sections of the text.

What about the editor?

One important thing I haven’t mentioned thus far editor related advantages. I use vim and I have strong preference for it. Because of this infatuation, I would recommend against both off- and online Latex editors such as TexStudio or Overleaf. While I’m no vim master, I think it boosts my text production and editing capability. The vimtex-plugin is really great. I think most people would greatly benefit from picking a performant editor (e.g. Vim, Emacs, or even Kakoune) and becoming really good at using it.