Evidence for an early prokaryotic endosymbiosis: I don't believe it!

I am thankful to Nick for the previous post, which eerily mirrors our own experience a year or two ago when trying to get a comment published on a problematic paper on flagellar evolution. Our experience—and his post—explains Nick's negative response the other day, when I suggested we marshall evidence for a "comment" on a paper by James Lake published this week in Nature!! So, let us be thankful for the power of blogging to get comments out (see Matzke on the flagellar paper) without having to negotiate a barrage of editorial and other hurdles. Here I outline a few problems I have with Lake's paper, in the hope that others flesh them out by adding comments here and even take up the challenge of submitting a formal response to the paper to Nature (I don't have time or energy for this).

The paper in question is Evidence for an early prokaryotic endosymbiosis. In this paper ,Lake claims to have evidence for an ancient endosymbiotic origin of Gram-negative bacteria from fusion of two Gram-positive lineages, the Clostridia and the Actinobacteria.

Lake starts his argument by lining up a list of what he sees as five "natural taxa": the Archaea (R), the Actinobacteria (A), the Bacilli and relatives (B), the Clostridia and relatives (C), and the double-membrane prokaryotes (D). To me, this choice of high-level taxa seems anything but "natural", because to the jobbing microbiologist, Clostridia and Bacilli belong together as spore-forming Gram-positives and according to traditional taxonomies belong together in the phylum Firmicutes. In a quick and dirty look, I cannot find any published work evaluating in detail whether the sporulation apparatus of these two groups is likely to date from from their common ancestor or whether one group acquired it by horizontal gene transfer, but my intuition would be for monophyly of both taxa and apparatus, with Firmicutes as a "natural taxon".

In one confusing paragraph, Lake makes much of the similarities between the inner membrane of Gram-negatives and what he calls "the outer membrane" of Gram-positives. In fact, given that Gram-positives have only one membrane, calling it an "outer membrane" is misleading. It is more usually called the "cell membrane" or "plasma membrane". And no one doubts that this Gram-positive cell membrane is homologous to the inner membrane from Gram-negatives—it is the outer membrane of Gram-negatives that needs explaining!

Although he seems rather obscure on this point, it seems that Lake is claiming that the outer membrane originates from the cell membrane of one or other of his two symbiotic partners, even though there is no evidence that the Gram-negative outer membrane (or its proteins) resembles the cell membranes from either of these  Gram-positive taxa (where is the LPS in the cell membranes of Actinobacteria or Clostridia?).

But it is the evidence from analysis of sequences that Lake puts forward for his hypothesis that troubles me most. Rather than look at individual protein sequences in individual genomes, Lake chooses to look simply at the presence or absence of protein domains within his chosen taxa, gleaning his data from the Pfam database. He uses seductively simple binary patterns of presence (+) and absence (-) as the basis for his phylogenetic analyses.

But even a cursory glance at the Supplementary Data reveals two problems:

  1. Double counting of some domains. Treating presence or absence of domains as independent character states might be acceptable if there were no stable links between domains, but when two domains commonly occur together in the same protein or pathway or macromolecular complex, it strikes me that this should be treated as one piece of data not two, e.g. MAAL_C PF07476 Methylaspartate ammonia-lyase C-terminus and MAAL_N PF05034 Methylaspartate ammonia-lyase N-terminus which are listed in the first table in the Supplementary Data among "the 99 Pfams selected for (A,B,C,D,R) equal (-,-,+,+,+)." A quick glance suggests that there are many such examples in the tables, but whether this alone is enough to undermine the conclusions is unclear to me.
  2. Asymmetry in what counts as presence in one taxon versus what counts as presence in another—not all + signs are equal! Let's take a quick look at the first entries in each of Lake's tables.
    Table S2A. The 99 Pfams selected for (A,B,C,D,R) equal (-,-,+,+,+). A2M_N PF01835 Alpha-2-macroglobulin family N-terminal region. Click on the link to access the PFAM entry and then click on the "species" link in the menu on the left hand side to see the phylogenetic distribution of the domain: hundreds species of double-membraned bacteria have the domain, but only two species of Archaea and two species from Clostridia (in Lake's sensu latu).
    Table S2B. The15 Pfams selected for (A,B,C,D,R) equals (-,+,-,+,+). Dna2 PF08696 DNA replication factor Dna2. Only one example of this domain in the Bacteria (in Pseudoalteromonas haloplanktis strain TAC 125) ! And four in a halophilic group of Archaea,but none among the Bacilli so cannot work out how Lake arrives at his -,+,-,+,+ designation!
    Table S2C. The 8 Pfams selected for (A,B,C,D,R) equals (-,+,+,-,+). DUF1002 PF06207 Protein of unknown function (DUF1002). A couple of dozen examples each in Bacillus and Clostridia, but just four in the Archaea.
    Table S2D. The 174 Pfams selected for (A,B,C,D,R) equals (-,+,+,+,-). 3D PF06725 3D domain. Does indeed have dozens of examples in each of the three phyla.
    Table S2E. The 62 Pfams selected for (A,B,C,D,R) equals (+,-,-,+,+). Adenosine_kin PF04008 Adenosine specific kinase. Dozens of examples in D and R, but just a handful in A, largely confined to Mycobacterium tuberculosis complex.
    Table S2G. The 73 Pfams selected for (A,B,C,D,R) equals (+,-,+,+,-). ABC_membrane_2 PF06472 ABC transporter transmembrane region 2 Hundreds in D, dozens in A, but only one in C!

OK, so, I haven't got to the end of the tables, but already it is clear that these tables contain mistakes and gross asymmetries in domain distributions.

Why do these asymmetries matter? Well, Lake uses his patterns of + and - designations to build his hypothesis. But inherent in Lake's + (present) and - (absent) designations for domains in taxa is the assumption that these patterns represent the ancestral state for the taxa, i.e. more or less reflect the domain distributions in the last common ancestor of each taxon.

Yet descent from that common ancestor is not the only explanation for the presence of a domain in a taxon—horizontal gene transfer might also account for the presence of the domain.

Now, I guess a working assumption of vertical descent might be justified if each taxon that had a domain, contained large numbers of that domain scattered across many species. BUT this is not true with the data Lake is presenting here--in some cases, just one example of a domain occurs in one protein in just one one species in a whole "natural taxon". In such circumstances, one's working assumption has to be that this domain has been acquired by that one species through horizontal gene transfer, rather than acquired from the whole taxon's last common ancestor and then lost by all species in it bar one!

And, as with many broad-reaching sequence-based analyses, there is a house-of-cards structure to the argument here, so if these tables in the supplementary data cannot be trusted, nor can the final conclusions, which now have to be regarded as "unsafe".

Anyhow, the above counts as just a quick dirty look at this paper and I may have missed some important points or got my wires crossed somewhere along the line. But I hope that others will now take an even closer look at it.

But for me, what I have seen already evokes a well-known response from my grumpy alter ego, Victor Meldrew: "I don't believe it!", even though it has supposedly passed peer review in Nature!