The origin of SARS-CoV-2 is a riddle: meet the Twitter detectives who aim to solve it

If you have read my previous article, you probably know already most of the mysteries that surround the origins of SARS-CoV-2. If you haven’t read it, then I strongly recommend to do so before reading on. In fact, the present article is a tribute to the people who helped unravelling those mysteries. You may have already heard about Alina Chan, but there are many other people, out of the spotlight, who deserve credit. They have been working tirelessly this year: reading academic papers, exploring barely known databases, sifting through Chinese media outlets. Their work place is Twitter. Some of them are professional scientists, others have different backgrounds, many use pseudonyms. They have a tendency to ask uncomfortable questions, and probably for this reason they are often blocked by other scientists and accused of promoting conspiracy theories. You may disagree with their unconvential approach, but the truth is that these people behave, to all intents and purposes, like a small scientific community: they search and analyze data, they share and discuss their findings and, more importantly, they make discoveries.

Guess who?

The first mystery that the Twitter detectives had to solve was the identification of RaTG13. This name appeared in the scientific literature for the first time in January, in a preprint by scientists from Wuhan Institute of Virology, to indicate the bat coronavirus most similar to SARS-CoV-2. Later on, on 3rd February, the same article (with some not-so-minor modifications) was published on Nature (Zhou et al., 2020). At that time we knew almost nothing about RaTG13, although this virus might have once been present in Chinese databases which were taken down, as later discovered by Yuri Deigin and @BillyBockstinson. Anyway, two days later another paper was published by scientists from Wuhan University in the Emerging Microbes and Infections journal (Chen et al., 2020). The authors write:

Phylogenetic analysis indicates that 2019-nCoV is close to coronaviruses (CoVs) circulating in Rhinolophus (Horseshoe bats), such as 98.7% nucleotide identity to partial RdRp gene of bat coronavirus strain BtCoV/4991 (GenBank KP876546, 370 nt sequence of RdRp and lack of other genome sequence) and 87.9% nucleotide identity to bat coronavirus strain bat-SL-CoVZC45 and bat-SL-CoVZXC21.
Chen et al., 2020 Emerging Microbes and Infections

When they wrote the manuscript, the most similar bat viruses were in fact ZC45 and ZXC21. They didn’t mention RaTG13 because that name was unknown back then, however they did notice a very strong similarity (99% identity) with a partial RdRp gene of coronavirus BtCoV/4991, which was present in Genbank and linked to a publication from 2016 (Ge et al., 2016).

Emerging Microbes and Infections is a minor journal and this paper was somehow eclipsed by the Nature article, in fact I guess that few people noticed it. One of them was for sure Rossana Segreto (Twitter account @Rossana38510044), biologist at the University of Inssbruck, who wrote a comment on Virology Blog on 16th March. Probably other people had read both papers and tried to align the BtCoV/4991 sequence from Chen et al. with RaTG13 from Zhou et al., but Rossana’s comment is the first public statement that I found suggesting the connection between the two names:

Comment by Rossana Segreto on Virology blog, 16th March 2020

The “same-virus theory” was confirmed on 9th May by Peter Daszak, close collaborator of Wuhan scientist Shi Zhengli. Daszak replied this way to a Twitter user who had asked him about this connection:

Tweet by Peter Daszak, 9th May 2020

That Twitter user was @schnufi666, who had just discovered that in a Chinese database specialized in bat viruses, the entry of BtCoV/4991 had been modified on 7th March, to include a reference to RaTG13. Later on, it was Shi Zhengli herself who confirmed the link between those names in an interview to Science published in July. Finally, after tweets, interviews and editing of little known Chinese databases, WIV scientists updated the RaTG13 entry on Genbank, adding the sentence “former lab designation: Bat coronavirus Ra4991”. They have done this on 24th November, but hey, better late than never.

Treasure hunt

Back in April, almost everybody in this small Twitter community was convinced that RaTG13 was in fact BtCoV/4991, although the connection was not stated explicitely in any paper. People began to wonder then, why the authors of the Nature paper didn’t cite their own paper from 2016, where RaTG13 was first described with a different name. Were they hiding something? To answer this question, the Twitter detectives started exploring the internet to find any relevant information about a mineshaft in Mojiang county (Yunnan), that is the place where RaTG13 was sampled in 2013, as noticed by Yuri Deigin (Twitter account @ydeigin). At that time, the discussions were often led by @luigi_warren, who used to summarize the main discoveries in clear threads such as this one.

The first clue for the investigation was in the 2016 paper itself. On 15th May, @_coltseavers noticed a sentence citing an earlier paper (Wu et al., 2014): apparently, the same mineshaft had been sampled in 2012, in search of henipaviruses (another genus of viruses that includes Nipah and Hendra). Reading the 2014 paper, it was immediately clear why that mineshaft was so interesting for the Chinese virus hunters.

In June 2012, in Mojiang Hani Autonomous County, Yunnan Province, China, severe pneumonia without a known cause was diagnosed in 3 persons who had been working in an abandoned mine; all 3 patients died. Half a year later, we investigated the presence of novel zoonotic pathogens in natural hosts in this cave. For the investigation, we collected anal swab samples from 20 bats (Rhinolophus ferrumequinum), 9 rats (R. flavipectus), and 5 musk shrews (Crocidura dracula) from the mine for virome analysis.
Wu et al., 2014 Emerging Infectious Diseases

So, the bat virus most similar to SARS-CoV-2 had been sampled in 2013 in a mineshaft where three people had died of severe pneumonia with unknown cause, the year before. Quite interesting, no? In fact, the news was reported in other websites. Roland Baker (Twitter account @RolandBakerIII) found an article in Chinese, describing the incident (“In June 2012, three men removing slag from a derelict copper mine in southwestern China fell ill with severe pneumonia and died.”). He also found a news in Science, where scientists explained that the cause of those deaths was yet to be discovered. Interestingly, Scientific American also mentions those events in a recent interview to Shi Zhengli, as noted by @luigi_warren. Oddly enough, this article blames a fungus for the pneumonia, which was quite surprising because, according to the papers, the Chinese scientists were searching for a virus.

Excerpt from “How China’s ‘Bat Woman’ Hunted Down Viruses from SARS to the New Coronavirus”, Scientific American 2020

At that point, the exact location of the mineshaft was still unknown, although Antonio Duarte (Twitter account @AntGDuarte) had noticed on 11th May that the red dot in a blind map included in Ge et al. 2016, was very close to Tongguanzhen, which is the administrative town of Tongguan township: maybe that’s where RaTG13 got its “TG” from!

Tweet by Antonio Duarte, 11th May 2020

Interestingly, while all these things were happening on Twitter, an Indian microbiologist reached independently the same conclusion: her name was Monali Rahalkar (Twitter account @MonaRahalkar), and her preprint was the first attempt to share the mineshaft story to a broader audience. Then, on 18th May, the plot twist. @TheSeeker268 found a Master thesis dealing with the pneumonia outbreak: “The Analysis of 6 Patients with Severe Pneumonia Caused by Unknown Viruses” (by Li Xu, supervisor Prof. Qian Chuan Yun, published in 2014). And a few days later, on 29th May, the same Twitter user posted a PhD thesis on the same topic: “Novel Virus Discovery in Bat and the Exploration of Receptor of Bat Coronavirus HKU9” (by Canping Huang, supervisor Gao Fu, published in 2018). The two documents, mentioned in a preprint posted by Segreto and Deigin, held a bonanza of information about the Tongguan mineshaft and the pneumonia cases. For instance, the two theses precisely describe the six patients as well as their symptoms, which were impressively similar to those of COVID-19, as reported by Monali Rahalkar in a paper published in Frontiers in Public Health (Rahalkar and Bahulikar, 2020).

Summary of the six pneuomina patients (Rahalkar and Bahulikar, 2020)
Common features observed in the six pneumonia patients and COVID-19 (Rahalkar and Bahulikar, 2020)

An important part of the work was carried out by @franciscodeasis, who did the first good translation of parts of the Master and PhD theses. The PhD thesis was particularly important because it revealed the GPS coordinates of the mineshaft, as noted by its discoverer The Seeker: 23°10'36.00'’N, 101°21'28.00'’E. Actually, those were the coordinates of a village named Danaoshan. The mineshaft itself may be quite close: according to Franscico De Asis, it is located at about 1.4km, in an area accessible by dirt road from Danaoshan.

Probable location of the Mojiang mineshaft where six people fell sick in 2012, according to Francisco De Asis (snapshot from Google Earh, 2011)

Our Twitter detectives got it right. On 17th November, Zhou et al. published an addendum to their original paper, that confirmed the mineshaft story: in 2012 the Wuhan Institute of Virology had analyzed serum samples from 4 patients with pneumonia, who fell sick after visiting a mine cave in Tongguan town. For that reason, the Chinese scientists made 1–2 trips each year to the mine, looking for SARS-related viruses that might explain the disease. Some of the details in the addendum do not fully match with the Master and PhD theses, but it’s still impressive how a group of Twitter users managed to bring to surface facts and events that otherwise would have remained probably hidden to the scientific community. In this regard, Rossana Segreto must be credited in particular, for her relentless pressing with the Nature editors, but likely several other scientists asked explanations to the journal.

The final riddle

Consider the information available in January: the closest relative of SARS-CoV-2 was a bat coronavirus, sampled in Yunnan in 2013, that we knew nothing about. Thanks to the research of tireless Twitter users, we now know that SARS-CoV-2 may be linked to an old story of unexplained pneumonia, a story that scientists in Wuhan, apparently, were not eager to share with the world. We also know that RaTG13 was sequenced in 2018, not after the COVID outbreak, as is stated in the Nature paper. It was Francisco De Asis who first noticed the dates in the names of sequences uploaded to NCBI in May, a finding later confirmed by Shi Zhengli in her answers to Science. But the quest is not over!

The addendum recently published on Nature revealed, in fact, that WIV had found other 8 SARS-related coronaviruses in the Tongguan mine, besides RaTG13. What do we know about those viruses? Basically nothing: the article doesn’t even mention their names, let alone their genomic sequences. But they may be somewhere, buried in minor papers or less known databases: this is the final riddle that the Twitter detectives have to solve. For months, Francisco De Asis has been tracking all the information available on this topic: in his huge Excel files he annotates publications, sequence entries, even time and location of the sampling expeditions. He believes the 8 viruses might belong to the so-called “7896 clade”, a group of novel viruses that recently appeared in a paper published on Nature (Latinne et al., 2020).

Tweet by Francisco De Asis, 18th November 2020

Actually, these viruses are never mentioned in the paper, but the Genbank IDs are reported in the Supplementary PDF, and their RdRp sequence (which is the only public sequence currently available) looks quite similar to those of SARS-CoV-2 and its closest cousins RaTG13 and RmYN02 (see this tree by @babarlelephant). Moreover, some of the RaTG13 sequences uploaded in May have a mysterious “7896” label. Will the detectives get it right also this time, like they did before? To find out, just follow them on Twitter: here’s a list with the most active users. You will probably come across tons of tweets by @billybostickson, who has been the aggregator and motivator of this group since the very beginning of this pandemic: just to get an idea of his amazing work, have a look at his “260 questions for scientists and the WHO on the origin of SARS-CoV-2” (Part 1, 2, 3).

If social media is not your thing and you prefer traditional scientific articles, no worries. Below there’s a list of works authored by people mentioned above, as well as others that I didn’t mention: for instance, you will find the first peer-reviewed paper speculating on the lab origin of SARS-CoV-2, by Dan Sirotkin (Twitter account @Harvard2H) and his father Karl, who designed dbSNP; the fascinating Bayesian analysis by Gilles Demaneuf and Rodolphe de Maistre; the preprint by Daoyu Zhang (Twitter account @flavinkins), who first questioned the data on pangolin coronaviruses; the lab leak hypothesis proposed by Anon, which perfectly fits with this detective saga; last but not least, the memorable Medium article by Yuri Deigin on the gain-of-function experiments that were carried out in Wuhan. So many people were involved, as you may see, but this should not come as a surprise: science is a collective enterprise, even when you do it on Twitter.

Peer-reviewed papers


Medium articles and blog posts

Bioinformatician, data scientist, science writer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store