Molecular Epidemiology of Spike Protein Sequences in 2019-nCoV: Origin Still Uncertain and Transparency Needed

Here are the stats for readership of this blog for the last two weeks. Nearly 200,000 hits.

OUR INITIAL ASSESSMENT that the available 2019-nCoV sequences contained an inserted stretch of nucleotide sequences upstream from the canonical position of the Spike (or Crown) Protein Sequence in the human samples that was similar to pShuttle-SN has been under useful and productive scutiny since we first published that we, unlike other labs, were in fact able to find a match between the “middle fragment” and sequences in non-viridae databases. The match to a pShuttle-SN vector technology, which led to the assessment that perhaps the sequence was the product of an attempt to modifiy a bat coronavirus in the lab has raised controvery but please note that was not the only evidence of interest. We know of viruses within which the SARS protein gene sequence has in fact been added to study the transmission of SARS virus; it has also been added to adenovirus to create hopeful vaccine, so it is not beyond reason to consider whether the virus currently estimated to be infecting >200,000 people in China might be a product of laboratory manipulation, and the reporting of the odd out-of-place sequence in the study that proposed recombination was also important. The divergence of the nCoV Spike protein compared to the rest of nCoV and the bat coronaviruses was also compelling.

The specific mechanism by which those factors could come out is unclear. They could also been due to unwitting recombination in between a SARS virus being studied in a lab that was also studying or housing animals with bat coronaviruses. Or recombination in a human infected with both The scientific community ruled out the possibility of natural recombination in the wild, whereas I preferred to leave a 5% chance that it might have been caused by a recombination event in the wild. Importantly, I still have not ruled that out.

The official Chinese position in an article published by by Dr. Shi a “Chinese Academy of Sciences researcher in the field of bioinformatics” is that the viruses are too different in comparison to other bat coronaviruses across the genome, with random, non-patterned changes, and that there are no endonuclease sites in bat coronaviruses and thus pShuttle-SN or other endonuclease technologies could not have been used, supporting recombination in the wild. The latter statement is demonstrably incorrect, there are many endonuclease sites in bat coronavirus sequences, determined using a bat coronavirus most similar to the sequence clade in question (trees below).

Dr. Shi is correct that there are scattered differences, but evolution need not pattern anything in an orderly manner in RNA viruses, which fast evolving, and we do not yet know that a recombination event might have occurred or have been used where the recombination occurred outside the Spike protein coding sequence, within the Spike protein coding sequence, or perhaps a combination of both. Evolution does not care to “pattern” things for human consumption; evolution brings forth viruses that work to create more viruses. When recombination is suspected, artificial or real, we study evolution at the sequence level best by inheritance patterns of motifs; overall rates are, as we have seen, and will see below, frustratingly limiting.

Before we look at the evolutionary trees, I want to stress that I have published and will repeat that given the mass casualities in China and the prospect of such events around the world, keeping the possibility of one or more recombination events on the table, or even a laboratory origin of nCoV2019 is important specifically and exclusively for scientific and humanitarian purposes. As human societies often do, people will want to rush to point fingers of blame and my position is that if it’s a vaccine type gone wrong, or even a bioweapon that backfired. Let’s not by hypocritical; the US and many other countries of course have been studying the SARS spike protein for vaccines and have of course been conducting research on bioweapon. That’s no longer the point. The point is – and the only point that matters – is that we have a massive humanitarian crisis in China and therefore (1) any available data on the pathophysiology of this virus, man-made in part or in toto or not, must be brought forward, (2) China needs aid to bring R0 down below 1,0, (3) the rest of world needs to act now to stop the spread of the virus by behavioral changes including routine mass effort for santization of common surfaces and self-social isolation (don’t touch other people or your face in public). The world cares, and should care, most about putting the fire out, not who or what started it.

However, Science might help provide clues for treatments and perhaps for rapid diagnosis. As promised to many who have contacted me, we have completed a more thorough (but unfortunately not exhaustive) analysis of Spike protein seqeunces with sequences that are at hand. A note of caution: some of the sequences may be different that then ones we analyzed due to what we were informed to be “database update errors”, whatever those include. We were contacted by a person who evidently knew about a “natural” bat coronavirus isolated in July 2013 at the Institute of Virology in Wuhan, China who pointed us to a sequence available NCBI’s Nucleotide database but that had been uploaded only in January, 2020. We do not know when when the sequencing was done, whether it was a frozen sample just recently sequenced, or whether it was isolated from laboratory propagated viral lines in human cell lines. Oddly, the original gap that we found, and that was reported in the peer-reviewed study that pointed to potential recombination in snakes, and that a second, independent peer-review study also found and could not match (and called a “middle fragment” can no longer be found using the same accession numbers. I am not certain of who curates NCBI’s databases at this time, but NCBI should have a record of any evidence of un-annotated updates and I leave it to them to sort this issue out.

To help elucidate possible relationships among the available Spike proteins, based on current sequences, including the ins1378 segment, I present two phylogenetic trees, derived at and rendered using The tree-generating algorithm was Neighbor Joining (NJ), invented by my postdoctoral mentor Dr. Masatoshi Nei, and Dr. Naruya Saitou in 1987. The tree was estimated using all variable positions, and raw differences. The Jukes-Cantor model was not used because it overweights nucleotide substitutions that might be more frequent and underweights nucleotide substitutions that might be less frequent during RNA virus evolution. N=1,000 bootstrap iterations were used to assess the confidence of the placement. Caveat: any within-sequence recombination is masked by the assumption that the process is tree-like; gappy areas were retained and not force-aligned. The full alignment is available here.

Size = 23 sequences × 1087 sites
Method = Neighbor-Joining
Distance = Raw difference
Bootstrap resampling = 1000
Alignment id = .200206212318090CHjzDGoNBMcL1glmdiZ7Plsfnormal

Tree 1 Bootstrap values.

Tree 2. Same estimated tree with branch lengths.

Clearly we see that the Spike protein from the 2019-nCoV human sequence is most similar to the sequence isolated from bat feces in the Wuhan Institute for Virology in 2013, deposited in January 2020. The next closest sequence is from the Institute of Military Medicine, Nanjing Command.

Looking a the raw distances, compared to those for the overall genome, the spike protein appears to more evolutionary labile – that is, there are more variable sites and the evolutionary distance is greater in the Spike protein-encoding sequences. The great distance between the Wuhan sequences and the other bat-like coronaviruses is distinct for the Spike protein, and contrasts with other published results. There are plenty of variable sites to have moderate confidence in this result (BSV = 82), however, compared to most bootstrap values published in coronavirus sequences and most of those in this tree, the value of 82 points to some signal other than inherited variation that covaries well with the rest of the inherited variation, just as in the original analysis with the low bootstrap value. (The higher bootstrap value here compared to the full genomic analysis placing 2019-nCoV more within the bat coronaviruses albeit with lower bootstrap values is likely due to a number of factors, include the use of a Jukes-Cantor constraint on the model of evolution in the original analysis).

The data do not support a 1:1 relationship with pShuttle-SN (as published in 2005) and the SARS-like spike protein in 2019-nCoV, and I never posited that relationship. I merely pointed out it was similar to it, when no one else could match the middle fragment to anything But the pShuttle-SN has ALSO been evolving in the lab, no doubt, and I would like to see a newly deposited sequence in NCBI’s Nucleotide database. Other vector tech has no doubt been used by other labs putting the SARS spike protein.

Parsimony (Occam’s Razor) would, with the existing sequences, tend to lead to the conclusion that the Spike protein is there simply because bat coronaviruses have Spike proteins. Does that mere fact lead to the assumption that a bat coronavirus never underwent recombination in nature or the lab (experimentally) or accidentally? No, it does not. The oddities in the behavior of the sequence data “updates” deserve further scutiny. Why would two peer-reviewed publications, one from China, mention a middle fragment that could not be aligned? These questions require transparency.

Nevertheless, the Spike protein relationships still seem to tell a different story of relationship of 2019-nCoV and related coronaviruses in Wuhan compared to published full genomic analyses that might be very important. They stand out as different, distinct. Some important limitations here are (1) all trees are estimates, not observations, and should not be used as “proof” of anything (science does not deal with “proof”;’ (2) the method assumes a tree-like relationship and cannot rule out recombination origins; (3) the method is based on limited data in terms of samples (taxon sampling). So, as always, more data may clarify.

Calls to Action for Scientists

I strongly encourage those who can to post their own analyses of the fasta file, or their own alignments, etc to the comments, especially if they are more relevant to the questions of recombination. Please understand if we cannot comment on each and every post given the flurry of activities ongoing at IPAK about this and other pressing issues.

  1. Detailed sequence-level analyses are needed to determine if there has been recombination or other editing of these sequences both inside and outside of the Spike protein region. Are there the expected number of synonymous and non-synounmous substitutions as would be expected under natural inheritance model? Is there any test that could be done to detect very important changes that might be adaptive to 2019-nCoV?
  2. Analyses capable of detecting recombination – or ruling it out – should be applied and published ASAP. Feel free to post links to any such analyses in the comment.
  3. Also, please post your own interpretation and comments, and reference other information as may be relevant.

All deposited sequences will continued to analyzed as we monitor the situation.



13 thoughts on “Molecular Epidemiology of Spike Protein Sequences in 2019-nCoV: Origin Still Uncertain and Transparency Needed

  1. Thanks so much for all the incredible work doc, in case it’s helpful I tie a lot of the technical elements – including your work – and the specific personalities involved here:

    ‪”Simply and horribly, this is likely to be yet another Chernobyl or Fukushima – a catastrophic illustration of mankind’s hubris and intransigence clashing with Nature, and fate once again reaping the once unimaginably tragic toll.”‬


  2. Oh and if this is the 2013 CV that was a match, this is certainly an interesting bread crumb:

    “Another Chinese virologist, Xing-Yi Ge, appears as an author on the 2016 UNC paper on the engineered hyper-virulent bat coronavirus and is also now attached to the lab in Wuhan. Previously in 2013, he’d successfully isolated a SARS-like coronavirus from bats which targets the ACE2 receptor, just like our present virus“

  3. So far, the nCoV-2019 has been reported to share 96.3% overall genome sequence identity to the Bat RaTG13 genome. They have confirmed that this novel CoV uses the same cell entry receptor, ACE2, as SARS-CoV However, the S1 Receptor Binding Domain (RBD) of the nCoV-2019 genome was noticeably divergent between the two at amino acid residues 350 to 550. We aimed to identity coronaviruses related to nCoV-2019 in viral metagenomics datasets available in the public domain. In a recently published dataset describing viral diversity in Malayan pangolins (PRJNA573298 16), we used VirMAP 15 to reconstruct a coronavirus genome (approximately 84% complete from samples SRR10168377 5 and SRR10168378 1) that shared 97% amino acid identity across the same RBD segment. This result indicates a potential recombination event for nCoV-2019.

    pangolins were sold in the market as well

    The receptor binding protein spike (S) gene was highly divergent to other CoVs with less
    than 75% nt sequence identity to all previously described SARSr-CoVs except a 93.1% nt identity to RaTG13. The S genes of 2019-nCoV and RaTG13 S gene are longer than other SARSrCoVs. The major differences in 2019-nCoV are the three short insertions in the N-terminal domain, and four out of five key residues changes in the receptor-binding motif, in comparison with SARS-CoV.

  4. Question: could the divergence with pShuttle-SN be explained by a much earlier lab escape of a non-pathogenic virus, which subsequently picked up mutations, and then recombined with the bat CoV recently to produce 2019-nCoV?

    1. Most would say it’s likely that the similarity is merely due to the fact that pShuttle-SN has a coronavirus Spike
      proten embedded in it.

      I would say go back and read that last part of that sentence again with the question: WHY? in your mind.

      Clearly recombinant research with Spike protein has been ongoing.

      Current sequences of pShuttle versions from labs all around China and the world would be interesting to see.

  5. I was wondering if you had viewed this web page:

    This page references a pdf file for download:

    This pdf contains photos of screenshots from the Genbank of NIH that leads one to believe the Wuhan Coronavirus has a strong link to a novel ZS bat-CoV that is owned by Chinese military labs.

    I do not know if this information is real or not, but think you may be in a position to clarify matters. If this is “fake-news” so be it. If these screenshots are real, we may have a smoking gun.

  6. i’m not a scientist, so just sharing my small loophole theory in the hopes that it may trigger another thought process or…? ~ Amy Darian Ramsey ******
    Since the Coronavirus began making the news, at the beginning of January, I have been studying the biochemical and medical treatment of it, specifically in regards to herbal medicine and which methods of herbal medicine would best treat it. While preparing my herbs, and in reading other herbalist’s notes and then reading more in-depth medical articles, an idea came into my head that I wanted to share.
    The last true Pandemic was the Flu of 1918, following directly on the heels of WWI. WWI was 1911-1918. The Spanish Flu of 1918 began in January 1918.
    What occurred during WWI? A failure to ration meat at the beginning of the war, causing a severe shortage of meat, and dairy products, for the duration of the war.
    What have we witnessed over the last few years worldwide (and don’t beat me up here, I’m not advocating or denigrating either or, I love my vegetables…) but a surge in the number of practicing vegans and non meat-eaters. Before you begin to tune me out, read what I have to say about the science of this in the following paragraphs.
    When a virus such as H1N1, SARS or the current Coronavirus infects a host, it is a Viral infection which stimulates the body’s immune system to react. Following the initial immune reaction, the body then experiences a SECONDARY bacterial infection with feeds (macrophage) on the the initial virus. This can be helpful in most healthy individuals. However, in SOME individuals, the secondary bacterial infection and macrophage triggers what is called a Cytokine Storm, in which the immune system overreacts, becomes overloaded, and storms the body systems with too much immune response clogging the pathways and shutting down the systems (lungs, kidneys, arterial, adrenal, etc.)
    What causes the differentiation in normal healthy individuals between cytokine storm and no cytokine storm? The de-regulation of the cytokine pathways. When the natural cytokine pathways are ‘disrupted’, they experience a ‘misflow’ and a traffic jam occurs. Very similar to electrical signals in the brain and epileptic seizures.
    I recalled reading about Sphingolipids which are in all animal and plant cell membranes. Ceramides comprise Sphingolipids. And Ceramides and Sphingolipids are affected directly by our diet. Higher levels occur in animal based foods. Increased levels of both increase inflammatory responses in the body and also create many other metabolic issues.
    I started to wonder if, somehow, there was a flipped scenario with this New Coronavirus that interfered with the inflammatory Ceramide Cytokine pathway during the secondary bacterial infection. So I started doing a little hunting around.
    I found a research detailing that the Cytokine levels of IL-1β (Interleuken 8 Cytokine – airway smooth muscle cell) were elevated for study participants who followed a vegetarian diet (see Cytokine Levels Table 2 NCBI Vegetarian Article attached).
    Interestingly enough, this specific cytokine is the one, that, during host-pathogen defense mechanisms, the receptors (NLR3) are activated to INDUCE further inflammation. The opposite result that is normally seen with vegan or vegetarian diets and inflammatory responses/cytokine responses and cytokine receptors.
    “The NLRP3 inflammasome is activated by diverse non-microbial danger-associated molecular patterns (DAMPs) derived from damaged cells, and induces inflammation by increasing IL-1β and IL-18 secretion.”
    Is there a possible dietary correlation between some of the Coronavirus fatalities? Were the Wuhan hospital patients fed a similar diet prior to onset of the secondary bacterial infection? Were they scant meat eaters? Is there a dietary correlation somehow? I know that everyone is on a witch hunt because of the disgusting practice of the wild food market which proliferated the Virus in the first place, but the inflammatory Cytokine vegetarian Ceramide research is very interesting to me.
    I am completely and totally aware that a diet rich in vegetables and fruits is the most healthy way to protect the immune system and to stave off disease and heal disease, but might we be overlooking something here?
    Is our current trend toward a Vegan/less meat diet unknowingly turning on the switch for this Coronavirus to induce cytokine storms in otherwise healthy individuals? Where all vegetable diets are proven to turn the switch off for all other metabolic disease and help heal and cure, could this virus be taking advantage of this one little loophole?
    Just a thought…

  7. Re: Yunnan Bat is the original host for 2019-nCov?

    Many, if not tens of thousands, of Chinese have questioned: We, around different locations in China, have eaten bats over thousands of years without SARS/nCov problem. Tell us, please, why, suddenly, the virus comes to Yunnan bats at 2003 and 2019, but people in Yunnan have had much less problem with the virus? US, Europe, Australia, etc., all have bats but are not the origin of SARS/nCov virus. Why?

    The PRC state-run media called such question as [rumors of conspiracy theory] and shut it down.

    Please email me, if interested, for I have other worth-notice points which I hope to discuss privately.


Leave a Reply