We now have 11x coverage of the H. melpomene genome, all in 454 shotgun sequence reads. A preliminary assembly gives average contig size of 2662bp, and N50 ContigSize of 8207bp. So a bit of a way to go until we have a full assembly, but good progress. If anyone out there is interested in having a look at the data with a mind to contributing to its assembly/annotation or for whatever purpose, please let me know.
We have just heard that the first data is now available from Baylor for the Heliconius melpomene genome project. Three full runs of 454, giving approximately 4X coverage of the whole genome. We will be aligning this against existing genome sequence to assess coverage and quality as soon as we can get hold of it!
This is an article I wrote for the Research Horizons magazine in Cambridge. I thought it might be interesting as a bit of a review of some of the areas of research underway at the moment among members of the consortium. It was written for a Darwin special issue – hence the quote at the start.
On the wings of a butterfly
Since Darwin’s time, Amazonian butterflies have fascinated evolutionary biologists as examples of evolution in action.
On reading Henry Walter Bates’ 1862 account of his travels in the Amazon, Charles Darwin was captivated not only by Bates’ description of the stunning diversity of butterfly species and wing patterns found in the Amazonian jungle, but also by the impressive mimicry between unrelated species. He wrote: ‘It is hardly an exaggeration to say, that whilst reading and reflecting on the various facts given in this Memoir, we feel to be as near witnesses, as we can ever hope to be, of the creation of a new species on this earth.’1
Bates hypothesised that mimicry evolved to confuse predators. Edible butterflies, for instance, copied the wing patterns of toxic species so that predators would avoid eating them. He also described what looked like evolution in action: he observed a continuum, from variable species, in which different wing patterns were found together in the same locality, through to related species with different wing patterns. Now, 150 years later, modern science has taken this to another level, with new research that attempts to uncover the genetic predictability of evolution by identifying the genetic basis of wing pattern mimicry.
The importance of pattern
We now recognise that not only do edible species mimic nasty ones (today called Batesian mimicry), but that several nasty species can also benefit from mimicking one another (Müllerian mimicry) – bees and wasps being a familiar example. Many of the Amazonian butterflies described by Bates are in fact Müllerian mimics, and the best-studied group are the genus Heliconius, the passion vine butterflies. Recent work has focused on the Heliconius butterflies as a case study in evolutionary biology.
Studies of Heliconius wing patterns in the wild have confirmed Bates’ hunch: changes in wing pattern play a big role in determining how successful the butterflies are in both mating and avoiding being eaten. Using flapping models with different patterns, the researchers have shown that the butterflies choose to mate with individuals that look the same as themselves; because of this, over time, different patterns are likely to split into new species. In addition, hybrids between populations with different patterns have intermediate patterns that are not recognised by predators as harmful and therefore suffer disproportionately from attacks, reinforcing the split into new species.
This dual role of wing patterns in signalling both to predators and to potential mates makes pattern a ‘key trait’ for speciation. As Bates suggested, shifts in wing patterns do indeed lead to the evolution of new species.
Signatures of selection
One of the current hot topics in evolutionary biology is to what extent we can predict the path of evolution. One particular Heliconius species (Heliconius melpomene) is an ideal system in which to address this question because it has many geographic populations with very different colour patterns. A major collaborative project focusing on the genetic basis of wing patterns is underway with funding from the Biotechnology and Biological Sciences Research Council (BBSRC), Royal Society, Leverhulme Trust and Natural Environment Research Council (NERC).
Over the past decade, the researchers have been collecting different forms of H. melpomene from around South America, carrying out genetic crosses at a field station in Panama. These crosses have shown that dramatic differences in colour pattern are controlled by just a handful of genes, and that these genes are clustered together on four out of the 21 Heliconius chromosomes. The genes act as wing pattern ‘switches’, turning on and off the presence of major pattern elements, such as a large red forewing band. The challenge is to find out precisely what these genes are and how they work.
In collaboration with the Welcome Trust Sanger Institute, regions of the butterfly genome are being sequenced to try and identify the specific nature of the pattern switches. The expectation was that the switches would correspond to well-known genes, perhaps controlling wing development or colour pigments. In fact the two genomic regions studied so far each contain around 20 genes none of which is known for its involvement in these processes. This is in itself exciting as it implies that novel mechanisms of pattern determination are operating; current research is focused on determining which, of all these genes, are having an effect in the butterfly.
Genetics of mimicry
What attracted Darwin and others to mimicry as a case study in evolution is its repeatability – the same patterns evolve in distantly related species. A key question for an evolutionary geneticist is therefore whether the patterns are generated by the same genetic mechanisms, or different ones. Again, Heliconius butterflies are a good system to study this.
Heliconius melpomene co-mimics another species, Heliconius erato, all over the neotropics – in any location you care to look you will find that the two species have evolved identical patterns. Recently, in collaboration with research groups in the USA, it has been shown that pattern switches in the two species are controlled by the same regions of DNA, such that genes at identical locations in the genome code for either a red forewing band or a yellow hindwing bar. This implies that evolution of the same mimicry patterns in the two species has been made easier by a shared genetic system. While predation against abnormal wing patterns drives the evolution of mimicry through Darwinian natural selection, a shared developmental system may bias the raw materials in favour of certain kinds of patterns.
Of course, the link between wing pattern adaptation and speciation requires changes in behaviour. The mating preferences of divergent populations need to evolve in order to match their wing patterns. Remarkably, crossing experiments currently being carried out in Panama show that the genes underlying these changes in behaviour are closely associated with colour pattern genes. It seems that there are ‘hotspots’ in the genome for evolutionary change, influencing traits as diverse as wing patterns and mating preference.
An enduring example
It is an exciting time to be studying butterfly mimicry. The combination of population genetic, developmental and behavioural approaches is starting to answer the issues Darwin and Bates themselves debated; questions which were posed at the very dawn of evolutionary biology. Over the last 150 years, Heliconius butterflies have persisted as an example of evolution in action. With the imminent sequencing of the Heliconius melpomene genome, they will no doubt continue to be so for some time yet. Charles Darwin would surely have approved.
1[Darwin, C.R.] 1863. [Review of] Contributions to an insect fauna of the Amazon Valley. By Henry Walter Bates, Esq. Transact. Linnean Soc. Vol. XXIII. 1862, p. 495. Natural History Review 3: 219–224.
Thanks to those in my lab who helped with the text, Laura Ferguson in particular. If anyone is interested in reading more about the idea of genomic ‘hotpots’ for evolution, there is a nice recent review of the evidence in Heliconius by Riccardo Papa and others, and a more general overview in Science magazine.
The Heliconius Genome Consortium has been awarded a BBSRC ‘USA Partnering Award’ worth almost £40k over four years. This will fund meetings to bring the consortium members together for genome annotation and analysis. In addition a number of lab exchange visits for postdocs and students will also be funded. The aim is to promote collaboration and interaction between consortium members – and in particular between the UK and US. We are also keen to collaborate with other labs working on insect genomes who may be interested in being involved in annotation of particular gene families.
We have also applied to NESCENT to fund HGC meetings based at Duke University in North Carolina.
January 8-10, 2009 * Harvard University
1. Genome Sequencing Strategy & Details
2. Bioinformatics, Databasing, & Tool building
3. Major Tasks & Working Groups
Summary by Jamie Walters, Sean Mullen, Chris Jiggins and Owen McMillan
Genome Sequencing Effort:
This community project has arisen through the collaboration and ‘federation’ of several labs each promising to contribute $15,000 towards a “Heliconius Genome Project”. The sequencing will be done in collaboration with the Baylor Genome Center.
Contributing members include (alphabetically):
ffrench-Constant, Kronforst, Jiggins, Joron, Mallet, Mavarez, McMillan, Mullen, Reed
Those who attended the meeting:
Stephen Richards (via Skype)
Durrell D. Kapan
Others who have played a major role in the consortium but couldnt make it to Harvard:
The current plan is to proceed focusing the majority of sequencing efforts on Panamanian (Darien) H. melpomene melpomene ‘reference sequence’. At the same time, lower coverage sequences will be generated for H. melpomene rosina and H. cydno (galanthus or chioneus), both also from Panama/Costa Rica. H. melpomene (genome = 295 Mb ) is chosen over H. erato (~400 Mb) because of the smaller genome size. This strategy was formulated on the idea of using ~$100,000 initially to see how far we get with coverage, assembly, and comparative genomics, reserving the remainder in case manual assembly or further sequencing is required.
There is a general agreement that it is important keep a major ‘comparative’ angle within the project and to focus on ‘pushing the limits’ in whatever ways seem reasonable and convenient so long as the major goal of obtaining a good genome sequence are achieved.
- Inbreeding of H. melpomene melpomene stock is ongoing in Panama; 5th generation currently ‘in production’
- Other lines not yet established, but Fringy suggested inbreeding wasn’t as important for the rosina and cydno as we will resequence from a single individual
- Sequencing methods
- H. melpomene melpomene reference sequence
- The question of male vs. female was raised. Several suggestions were made and the relative merits discussed:
- Males would give equal coverage of each chromosome, but we’d lose the W. including microdissection & isolation of the W chromosome for independent sequencing.
- Ultimately Fringy told us that because the W was such a small portion of the genome that it didn’t matter, so the decision was made to sequence females in the hopes of getting some W sequence.
- Multiple (6?) 454 runs involving several different library preparations (each 454 run should give 500 Mb in 250-500 bp reads).
• Fragment library (5 μg DNA required)
• 2 Kb library (5 μg DNA, 10 μg preffered)
• 20 Kb library (30μg DNA required – includes some for ‘backup’ library)
• 30-40% of data generated will come from paired-end reads, while 60-70% will be randomly cut shotgun reads
• Overall minimum 50μg DNA needed from H. melpomene melpomene sequencing
DNA preparation: Fringy suggestion a Qiagen kit, though many suggested Phenol-Chloroform might be better. In any case, extraction needs to have no degradation, a 260/280 of 1.8-2.0, with quantity checked on an agarose gel with a known-concentration of size standard on the gel.
• 12x 454 runs to get 20X coverage
Other Races/Species: The H. melpomene rosina and H. cydno (galanthus) data will come from deep short-read sequencing, either Illumina or SOLiD, whichever makes the most sense at the time.
H. cydno is ~ 2.5% mtDNA divergent from the ‘reference’. The thought is we would comfortably align genes & coding regions, but that many intergenic regions might not assemble.
Use ½ of a Illumina run (3 lanes) for each of these additional lineages.
Following from Jim’s presentation of his project, we also decided to collect DNA for a rayed amazonian race, H. m. aglaope which might be sequenced as well as/in addition to H. m. rosina.
Assembly: This is potentially a very problematic issue. The best case scenario is that we can feed all the 454 data into the newbler assembly software, wait a day, and have our scaffolds. A newbler assembly is effectively free and there is a clear precedent of success for de novo assembly for organisms with genomes up to ~200 Mb (i.e. D. melanogaster) with an N50 of 20-50kb for contigs and 3 Mb for scaffolds. The scary part is that newbler currently chokes on the Helicoverpa data (~500 Mb genome @ 12x coverage) and will only assemble up to 6x and then dies. Fringy et al. are actively working on this. It is currently unclear how our data will assemble using Newbler…
Alternatives include: Arachne (Broad Institute), Jazz (JGI), MIRA
Manual annotation the ‘old fashioned’ way: three levels i) atlas assembly ii) overlap graph [one massive contig; massage to get rid of repeat joining; “lots of tuning”; Phrap] iii) join to data [ mapping/ fish on chromosomes ]
• The issue is that this manual approach is not scalable, takes months of work, and will likely cost ~$25,000 if it is needed.
Mapping Scaffolds: Once we get a good set of scaffolds, the next step will be to order, orient, and assign them to linkage groups. There was a clear consensus that linkage mapping with RAD-tag markers would be the best way to do this.
Mapping family: Luana and Chris will aim to obtain a large mapping family from a cross between the reference strain of H. m. melpomene (if possible the inbred line) and H. m. rosina. This will be used for RAD genotype mapping giving a high density SNP map with reference sequence tags at each marker, many of which we hope to align to genome contigs allowing long-range scaffolding of the finished assembly.
Expected stats (ballpark): Given a genome of ~300 Mb, Fringy expects to have ~250 Mb in the assembly, ~10,000 contigs with N50=40 Kb giving rise to 100-200 scaffolds of ~500 Kb
Annotation & Curation: Following the rubrick introduced by Alexie, “annotation” is the automated identification of genes, while “curation” refers to the manual review, editing, and confirmation of the automated annotations.
Annotation – there are many options for annotation software such as
GLEAN: the best tool currently available; integrates gene models, ESTs, and other relevant data into a single ‘best’ set of gene predictions
NCBI RefSeq pipeline.
Full Length cDNA: It was suggested that 100-200 well-annotated full-length cDNA sequences would be useful for verifying gene predictions. Pick a diverse array of genes, including genes with low-level expression that might not get well-covered by ESTs…
Curation – it was widely acknowledged that this is an enormous task that will take many hours of work from many, many people. Also, this task should be viewed as a continuously ongoing process rather than something that would produce a ‘final product’
Annotation jamboree – there was a lot of talk about organizing an Annotation Jamboree, essentially a ~two week session (perhaps hosted at NESCent) where everyone convened to work on this effort. Fall 2009 was suggested as a tentative time .
Curation by gene group – another common and popular idea would be to assign curation tasks by gene groups, where a lab with a interest in a particular type or family of genes would be responsible for annations. Offering authroship on final genome paper could potentially be a good way to entice diverse groups into participating in the annotation/curation effort
Curation & coursework – Adriana Briscoe suggested the idea of creating a course or course module where students would be responsible for curation of a gene set, and would have to defend their decisions and explain their results. (see below in Outreach section for further details).
Bioinformatic infrastructure – it was agreed that successful broadscale community (or student) participation in the curation effort would require establishing an accessible bioinformatic framework to allow curation to proceed in parallel. It was not clear whether such tools already exist or how exactly this would come into existence, though it would probably involve the servers set up by Bob Reed and the ffrench-Constant group (particularly Alexie P.) would probably play a key role.
Time Frame: there are currently ~5 insect genome projects in the works at Baylor, but according to Fringy precedence would likely be primarily determined by who first gets the samples to Baylor soonest. Once that happens, the sequencing would proceed quite quickly, perhaps completed in a month or less.
The bioinformatic challenges associated with generating a ‘well-groomed’ and widely accessible genome sequence are substantial and essentially indefinite; they will require continuous and ongoing efforts for years to come. The most obvious and immediate challenge we face is developing the hardware & software infrastructure for genome data sharing and curation. Another immediate challenge is communication & coordination within the community. Longer-term goals include integrating a wide range of genetic and diversity data into a database that also contains genome information.
It was not immediately clear at the meeting where the financial or human resources would come from to support these bioinformatic challenges in the long term, particularly with regards to the genomic data. The ffrench-Constant and Blaxter labs were both suggested as likely candidates during the meeting, but after further discussion it was clear that neither group currently had the resources to support this project in its entirety. Discussion centered around the need for a dedicated technician to develop, coordinate, and maintain the bioinformatic resources. Alexie P. suggested he (or the ffrench-Constant lab) could supervise this technician could money be found to fund the position. There was consensus that all future grants coming out of HGC labs should write in funds for bioinformatics support (such as hiring a dedicated technician).
More immediately, there are several important resources already in place. Bob Reed has secured servers to host the genome database and enlisted experienced bioinformaticians to help with the set-up. Alexie and the ffrench-Constrant lab are about to launch a GMOD version of ButterflyBase. And Jim Mallet has funds immediately available from his trans-species transcriptome project to fund bioinformatics support.
Genome Data sharing and curation
Servers – Bob Reed has secured 3 servers to host the genomic information and other related content/fuctions. One will be a genome browser, the second a BLAST server, and the final is a backup. These will be based at UC Irvine.
Set Up – Scott Cain (CSHL) and David Clements (NESCent) are ‘on board’ to set up these servers to run a GMOD database (GBROWSWER). There will be a systems administrator at UC Irvine maintaining these systems
Annotation/Curation software module– One important piece of the bioinfomatic infrastructure that was discussed extensively (though more often implicitly than explicitly) is a gene curation module. The idea is to have a easy-to-use tool that will allow manual curation of gene predictions to be performed by pretty much anyone. This tool would be the backbone of any annotation jamboree or course module that have many people working in parallel on annotations.
Genome Database Long term – the database should be structured on a GMOD/CHADO schema, which is currently the ‘industry standard’ and will support the ‘diversity modules’ that O. McMillan has been working on. Other suggestions include modelling the interface on the pea aphid database (aphidbase.com) and incorporating an Intermine/Flymine like module (Gos Miklam at Cambridge heads this)
CamTools – Currently C. Jiggins will host/support data sharing at Cambridge via CamTools. There is tentative agreement in place among HGC labs that all ‘genomic’ scale data will be posted immediately on CamTools for the broader community to access. Chris has added a post to heliconius.org about the agreement underlying data sharing on the site – click here.
ButterflyBase & Cornwall Resources
In contrast to data sharing (as hosted by CamTools and the UC-Irvine servers), Alexie shared resources available for data processing
96 CPUs & 15 Terabytes disk array, backed up in Cornwall. These are available to support Heliconius research but the precise role of the Cornwall group in bioinformatic analysis remains to be determined.
2500 CPUs HPC High performance computing in Dresden
A new release of ButterflyBase is coming soon. It will be based on a GMOD framework.
Community Organization & Communication
Heliconius.org – this website will serve as ‘point of departure’ for all things Heliconius, encompassing both information for public audiences as well as the research community. Chris has already updated the website using a WordPress content management system. If you want to post something please sign up as a user. Currently all posts are ‘public access’ but in the future we can password protect posts if we feel there is a need to do so.
Google sites – in the short term we are using tools from Google to organize and communicate
A google ‘wiki’ that allows content and announcements to be shared and communally edited
The meeting summary & MOU (this document)
Major Tasks and Working Groups:
The meeting culminated with a focused discussion of immediate tasks/goals and assignment of individual people to “working groups” responsible for periodically updating the broader community on progress in specific areas. These working groups consists of 1) Bioinformatics resources, 2) Animal Husbandry
1) Get high molecular weight genomic DNA to Baylor. Brian Counterman is going to take care of this. Time frame is by end of Feb? He will experiment with several techniques including Phenol/Chloroform, Ethanol Precipitation, and kit-based approaches using non-critical tissues. We are shooting for a minimum of 60 ugs of DNA.
2) Setting up Genome Server – Bob Reed has secured machines and server space through UC Irvine to host the genome database. In addition, he arranged for funding for a IT person to maintain the browser. Bob has also been in close contact with the GMOD folks who are happy to help us get the browser up and running but they will not maintain it.
3) Mapping populations need to be established 1) using the inbred H. melpomene stock crossed to H. m. rosina. Chris suggested that Luana Maroja might initiate and maintain these crosses. A priority is to get the inbred line to Larry + elsewhere.
Of critical importance is whether we can use RAD-Tag mapping to generate linkage scaffolds to aid in the genomic assembly. Fringly seemed to think this was an excellent idea. Chris is collaborating with Mark Blaxter to develop this technique, and Tony Long is another potential collaborator.
4) Complete the MOU among the group members. Will be largely based on the existing draft written by Owen/Bob/Chris.
5) Summarize meetings and put out monthly newsletters…Sean and Jamie will head this effort
6) Pursue funding on an individual and community basis.
7) Communicate with Baylor and arrange payment! – Owen and Chris.
Management Committee: Owen and Chris will coordinate the interaction with Baylor by means of regular conference calls during the course of the project. The meetings will be advertised by email ahead of time so that others can join in if needed/interested, and results will be circulated or posted on heliconius.org.
Bioinformatics working group:
This working group consists of Alexie Papanicolaou, Paul Wilkinson, Owen McMillan, Jim Mallet, Rob Jones, and Chris Jiggins. The major goal is to pursue and develop bioinformatics infrastructure and man power.
Alexie repeatedly argued that we need to provide a salary for a full-time database manager who could be primarily associated with the ffrench-Constant lab group. He suggested this would cost about ~18-20K/year. It seemed clear that Alexie and Paul have the expertise to handle the main computational challenges associated with the genome browser and database but not the time.
One major goal is also set up a browseable EST database maintained by a dedicated technician. Jim is going to discuss this with Mark.
Owen argued that we need to decide as a community what we want for our bioinformatics dollars. He suggested getting a CHADO butterfly base up and running was important and also highlighted the need for a diversity browser.
Finally, Alexie made it clear that developing the resources to handle the genome data is something that must be done NOW before the data starts coming in.
Animal Husbandry Working Group:
This working group includes Chris Jiggins, Brian Counterman, Owen McMillan, Larry Gilbert, and Marcus Kronforst. (others?).
The major goal of this group is to coordinate efforts to establish and protect critical butterfly stocks. Chris and Owen are managing the facility in Panama which will aim to maintain the most critical stocks, but it will be a priority to spread these around to other labs such that we have a back-up if anything is lost in Panama. Larry already agreed to take the melpomene melpomene line in Texas. This working group will handle Major Tasks 1 and 3 listed above. This group should also contribute to the Living Stock grant proposal.
I have established a server for sharing of unpublished genomic data among the community using a portal called CamTools run by the University of Cambridge. The agreement is that anyone using the site agrees to share ‘genome’ scale data sets with other users. Currently this is primarily Genome Survey Sequences, BAC clone sequences and transcriptomic sequence, but could be extended to other kinds of data as they become available. The aim is to make such information available for comparative analysis to all groups prior to publication. Here is my take on how you should approach use of these data:
1) If you plan a genome-wide comparative analysis using a data set generated by another researcher, then you should contact that person ahead of time and discuss the project.
2) If you are interested in fishing someone else’s data set for a particular gene, then feel free to download the data onto your own PC and have a look to see what is there. However if you find what you are looking for and are interested in using that information to further your own work then please talk to the person who obtained the data. In general I think that everyone should be open to their sequences being used provided the proposed project doesnt directly overlap with a project currently underway in the lab of whoever owns the data (or their collaborators). Every file is identified by the person who uploaded it – so it should be easy to get in touch with the original owner.
Of course, this site isnt designed as a public portal for genomic data – there is no genome browsing or BLAST facility. Its just a secure way of sharing large files.
Anyway, thats my take on things – let me know if you disagree or have any further comments, or if you want to use/contribute to the site.
The Heliconius Genome Consortium recently met in Harvard to discuss the upcoming sequencing of several Heliconius genomes. The agreed plan is that we will start by sequencing Heliconius melpomene melpomene at around 15x coverage using 454 technology. This will include a proportion of both 20kb and 2kb insert paired end reads. In addition, we plan to resequence H. melpomene aglaope and H. cydno using Solexa at around 30x, as a proof of principle for genome resequencing.
A more details summary of the meeting should appear on this site in the near future.
And here are all the participants in Harvard:
Top: Alexie, Sean, Larry, Chris, Jamie, Owen, Mathieu, Jim
Middle: Bob, Ryan, Adriana, Brian, Rob, Durrell
Front: Kunte, Biff, Paul, Nicola, Marcus