30 April 2012

Playing with genes

Some people play sport, others play Xbox. And now there is a growing group of people who play with DNA sequences.
Studying DNA as a hobby

I know it’s bad coffee-shop etiquette, but I keep peeking at the laptop computer screen of the young woman beside me. I see long stretches of A’s, T’s, G’s and C’s and a large colorful map of chromosomes. She’s definitely a geneticist.

“Sorry to disturb you,” I say, tilting my own laptop screen towards her, revealing a genome diagram, “but it looks like we’re both working on similar things.” Soon I’m telling her about my postdoctoral research on jellyfish genomes and she’s describing to me her work on salmon genetics. I ask her if she’s a PhD student. She laughs and says, “I’m actually interning as an investment analyst.”

She then explains how a few years ago a friend taught her some basic bioinformatics skills and how to download DNA sequences from the Internet. “Ever since, I’ve been assembling and analyzing genomes in my spare time. I have no formal training in biology, but I’ve learned the basics through a few introductory textbooks that I ordered online. It’s a weird hobby, but a great way to unwind from work and learn about evolution.”

She’s not alone. Across the world, hobby geneticists are exploring the huge number of DNA sequences that are freely and publicly available on the Internet at websites like GenBank, which is supported by U.S. National Institute of Health, and EMBL-Bank, which is funded by various European member states. These online genetic storehouses have everything from the human genome to the smallpox genome, as well as wooly mammoth and Neanderthal DNA sequences. They are easy to navigate, containing user-friendly interfaces and simple search menus. A search of GenBank using the keyword “dog” recovers more than 200,000 thousand entries, including complete genomes of the North American coyote, gray wolf, and domestic dog. These genomes can be downloaded in minutes by anyone with a computer and an Internet connection.

I ask my new acquaintance at the coffee shop what types of software she uses to analyze her DNA sequences. She recites a long and impressive list of computer programs for geneticists, many of which I employ in my own research. “I’ve tried a lot of the different freeware programs,” she says. “Some are great, but given my limited knowledge of computers, I avoid those that need to be run from a command line. Last Christmas, I convinced my parents to buy me an easy-to-use, all-in-one bioinformatics software suit with a graphical interface. They were reluctant to get it for me at first because it cost four hundred dollars and I’m not a biologist. But I argued that it was comparable to the price of an Xbox or an iPad – items that they’d already purchased for my brother. I also showed them some of the cool things that I could do with the program, like constructing an evolutionary tree of honeybees or looking at the sex chromosomes in frogs. They couldn’t believe that the DNA sequences from all of these different species were on the Internet, for everyone to explore.”

The number of DNA sequences at websites like GenBank is growing exponentially. In the year 2000, GenBank contained around one million DNA entries. Now it boasts more than 150 million, amounting to over 100 billion base pairs of DNA. This rapid increase in genetic data is a reflection of recent advancements in DNA sequencing technologies (often called “next-generation” sequencing techniques), which have made it cheap, easy, and fast to generate millions of nucleotides of genomic data. Moreover, when scientists publish in academic journals they are required to submit all of the DNA data that they describe in their articles in online sequence repositories.

Many scientists also submit the raw data that come directly from next-generation sequencing machines: millions of short, unassembled nucleotide sequences, stored in a single file. These types of data, which are housed in a special section of GenBank called the Sequence Read Archive, are great for both researchers and hobby geneticists because they often contain information that was ignored or overlooked by the primary investigators. For example, a file of next-generation DNA sequences derived from a green alga could also contain DNA sequences from the different viruses and bacteria that live in or around that alga, meaning that a hobby geneticist could use these data to assemble novel viral and bacterial genomes.

My hobby geneticist friend tells me how she is using next-generation DNA sequences to assemble the genomes of Atlantic salmon and the sea lice that parasitize them. She got the idea for this research from watching a documentary film on British Columbia’s salmon fishery. “One part of the movie focused on genetics,” she says, “and how biologists are using DNA sequences to study the impact of sea lice on the salmon farming industry. On the internet, I located next-generation sequencing projects for both salmon and their sea lice ectoparasites, and now I’m testing to see if there has been any lateral exchange of DNA between these two species.” I ask her if she plans to publish any of her results. “I mostly do this for my own enjoyment,” she explains. “However, I did email some of my findings to a professor a the University of British Columbia. He was surprised to hear from me, but has been very helpful and encouraging. He’s even asked if he could use some of my analyses in a paper that he’s writing. If it works out, I’ll have a scientific publication to my name – not that it will help me much in the world of investment banking.”

I return to my work on jellyfish genomes. But soon I’m distracted by a nagging image: I keep picturing an army of hobby geneticists fervidly descending upon my hard-earned data – all of which I sent to GenBank last week!

David Smith is a postdoctoral fellow in the Botany department at the University of British Columbia, in Vancouver, Canada.

