I’m obsessed with genomics. As you might imagine, it’s not unusual for genomes to come up in conversation with me. I’m frequently asked what a genome actually is. According to the Oxford English Dictionary, a genome is “the complete set of genes or genetic material present in a cell or organism”. The gene part is straight forward; a gene is a region of DNA that ultimately results in a protein. We call this coding DNA because the genes serve as templates and code for specific proteins. A gene is familiar thing to most of us. You don’t have to work in a lab to know about genes. We hear about them all the time in the news when scientists discover that a specific gene could be important for cancer research, or when people compliment each other citing their “good genes”.
It turns out that we have about 20,000 genes in our (haploid) genome. That’s a big number but it only accounts for 1.8% of the DNA in our genome. More than 98% is noncoding DNA. That doesn’t mean it’s all junk. Some regions of DNA don’t code for proteins but are still transcribed as important RNA molecules. Some of the noncoding DNA is regulatory, meaning it affects how specific proteins are used.
When I first found out that most of our genome is noncoding, it made me think about how big the genome is. It’s a difficult thing to visualise. Various textbooks and documentaries have used analogies to help me visualise the amount of data. They’ve used measurements such as number of encyclopedias, number of CD-ROMs, or even the number of floppy discs. While these things have some meaning for me (I like books and computers and yes I used floppy discs), it’s hard to appreciate the amount of information. “A genome would fill a few squintillion books.” What books? How big are the books? How thick? How many pages? I can’t visualise it because I’ve never seen that many books at once. Actually, is squintillion a thing? I think of books and discs as whole objects. I don’t think of bytes or characters when I look at a book. However, there is one place where I do constantly think of characters…
Twitter is something people love or hate. It’s the internet service nobody knew we needed until we had it. If you haven’t used it yet, let me explain. Twitter lets users type short status messages (tweets), reply to others, and keep an eye on what people all over the world are talking about thanks to trending topics. What makes Twitter a bit different is that a tweet can only contain a maximum of 140 characters. Everyone who uses Twitter is aware of this word limit, so it’s easy for them to visualise just how much data is in a tweet. If they’ve tweeted 10,000 times, they have a rough idea just how much content exists in 10,000 tweets because they know how long it has taken them to generate that many arguments and pictures of cats.
Because users of Twitter appreciate the amount of information in a tweet, I always thought it would be cool to tweet an entire genome. I thought this would be a great way to visualise just how much DNA is stored in a genome but it had never been done before. When I first decided to do this, I approached various programmers to help me since I’m no programming expert. I needed a program to obtain a genome; prepare it in such a way that it could be ready for Twitter; and then to tweet it, 140 nucleotides at a time, automatically. I wasn’t able to find anyone willing to help, though I did find plenty of people who were interested in the project. I decided to teach myself some languages and just get it done myself. I learned Perl (which is old but commonly used in bioinformatics) and Python (which is modern, generally useful, and relatively easy to learn). I chose to run the entire project through a £30 Raspberry Pi computer because it’s awesome.
I decided to start with the HIV-1 virus as a test because it has a tiny genome. If it worked, my next attempt would be the genome of E. coli because its genome is average-sized for bacteria but still manageable compared to our own genome. I would be thrown in Twitter jail if I tweeted too often so I had to limit the project to one tweet every 90 seconds. HIV would only take 70 tweets and be completed in around 2 hours. I started it, went for a run, and came back to find I was the first person to ever tweet an entire genome. It was time to get bigger. The E. coli genome is about 4.6 million base pairs long including over 4200 protein-coding genes. It was going to take about a month to tweet. I got it ready and running then started to think about what else I could do. I was surprised by the attention the project was receiving.
— Ed Yong (@edyong209) November 6, 2013
Having worked with a well-known virus and bacterium, the next logical step was to tweet a famous fungi before graduating to animals or plants. HIV took just under 2 hours and E. coli ran for over 34 days before it was complete. The genomes were going to get a lot bigger as I worked with different organisms. I created this crude table to compare potential genomes for the project:
My heart sank when I calculated how long the human genome would take. Please don’t fall into the trap of thinking humans are special because they have a huge genome compared to the others. If I was feeling particularly ambitious and optimistic, I could tweet a fish genome such as Protopterus aethiopicus (the marbled lungfish). It would take 2652 years to tweet its 130,000,000,000 bp genome. Note that the genomes of animals aren’t necessarily large compared to non-animals. Paris japonica (キヌガサソウ, “the canopy plant”) has a 150,000,000,000 bp genome that would take over 3060 years to tweet. There’s huge variability in genome size even among animals or plants. Nasuia deltocephalinicola, an insect, has the smallest known animal genome at only 112,000 bp. It would only take 20 hours to tweet. Many bacteria and even viruses have bigger genomes than this animal! Extremely small genomes are frequently found in highly derived animals occupying very peculiar ecological niches; most of the smallest animal genomes belong to parasites.
The human genome is the ultimate goal, consisting of approximately 3,200,000,000 nucleotide base pairs (slightly less are actually sequenced, the point is it’s a big number). How long would that take to tweet? At my current rate of just under a thousand tweets per day, and assuming no major delays for technical reasons, the human genome would take approximately 65 years to tweet. Doable? Definitely. Practical? Probably not. I hope Twitter is still around in 65 years time! Rather than admit defeat and say it’s not possible, I chose a compromise. Instead of one account tweeting the entire human genome, I’ve started 24 accounts tweeting chromosomes 1-22 and the two sex chromosomes (X and Y). Because they all start at the same time, the genome will be complete once the largest chromosome has finished. Chromosome 1 is the largest at approximately 249,000,000 bp, which will take 5 years to tweet! This single chromosome will take twice as long to tweet as the entire fruit fly genome. Some may think 5 years isn’t practical but it’s better than 65!
The goal of GenomeTweet was to allow Twitter users to relate genome size to number of tweets and really see the differences in a weird, fun, personal way. Sure it’s more of an art project than science, but it was a great way for me to learn to code and it helped communicate the diversity of genome sizes among organisms. It got quite a lot of people talking about genome size diversity on Twitter. The project had always been a thought experiment for a few people but nobody had made it happen. How could I resist? Once I get an idea in my head I usually make it happen. Case in point, I was amused by this retweet:
*Bass Solo* RT @GenomeTweet: GGATGAGATCCGCGACTGGTGGCAGCAAATTGAACAGTGGCGCGCTCGTCAGTGCCTGAAATATGACACTCACAGTGAAAAGATTAAACCGCAGGCGGTGATCGAGACTCT
— Dave Briggs 🦉 (@xtaldave) October 31, 2013
Could the different base pairs represent musical notes? Could I write a script that took the HIV genome and turned it into music we could actually listen to? It turns out yes. Never underestimate the things I will do to procrastinate.
I’ve started the human chromosomes as well as a few others. This year I’ve been creating a robot (alongside a butler and parody headline generator) that uses my Raspberry Pi as a brain, so it will soon resume GenomeTweet and hopefully our entire human genome will be on Twitter by 2021. Some of you might ask why? I say why not?
Main image © iStock/alanphillips