January 8-10, 2009 * Harvard University
1. Genome Sequencing Strategy & Details
2. Bioinformatics, Databasing, & Tool building
3. Major Tasks & Working Groups
Summary by Jamie Walters, Sean Mullen, Chris Jiggins and Owen McMillan
Genome Sequencing Effort:
This community project has arisen through the collaboration and ‘federation’ of several labs each promising to contribute $15,000 towards a “Heliconius Genome Project”. The sequencing will be done in collaboration with the Baylor Genome Center.
Contributing members include (alphabetically):
ffrench-Constant, Kronforst, Jiggins, Joron, Mallet, Mavarez, McMillan, Mullen, Reed
Those who attended the meeting:
Stephen Richards (via Skype)
Durrell D. Kapan
Others who have played a major role in the consortium but couldnt make it to Harvard:
The current plan is to proceed focusing the majority of sequencing efforts on Panamanian (Darien) H. melpomene melpomene ‘reference sequence’. At the same time, lower coverage sequences will be generated for H. melpomene rosina and H. cydno (galanthus or chioneus), both also from Panama/Costa Rica. H. melpomene (genome = 295 Mb ) is chosen over H. erato (~400 Mb) because of the smaller genome size. This strategy was formulated on the idea of using ~$100,000 initially to see how far we get with coverage, assembly, and comparative genomics, reserving the remainder in case manual assembly or further sequencing is required.
There is a general agreement that it is important keep a major ‘comparative’ angle within the project and to focus on ‘pushing the limits’ in whatever ways seem reasonable and convenient so long as the major goal of obtaining a good genome sequence are achieved.
- Inbreeding of H. melpomene melpomene stock is ongoing in Panama; 5th generation currently ‘in production’
- Other lines not yet established, but Fringy suggested inbreeding wasn’t as important for the rosina and cydno as we will resequence from a single individual
- Sequencing methods
- H. melpomene melpomene reference sequence
- The question of male vs. female was raised. Several suggestions were made and the relative merits discussed:
- Males would give equal coverage of each chromosome, but we’d lose the W. including microdissection & isolation of the W chromosome for independent sequencing.
- Ultimately Fringy told us that because the W was such a small portion of the genome that it didn’t matter, so the decision was made to sequence females in the hopes of getting some W sequence.
- Multiple (6?) 454 runs involving several different library preparations (each 454 run should give 500 Mb in 250-500 bp reads).
• Fragment library (5 μg DNA required)
• 2 Kb library (5 μg DNA, 10 μg preffered)
• 20 Kb library (30μg DNA required – includes some for ‘backup’ library)
• 30-40% of data generated will come from paired-end reads, while 60-70% will be randomly cut shotgun reads
• Overall minimum 50μg DNA needed from H. melpomene melpomene sequencing
DNA preparation: Fringy suggestion a Qiagen kit, though many suggested Phenol-Chloroform might be better. In any case, extraction needs to have no degradation, a 260/280 of 1.8-2.0, with quantity checked on an agarose gel with a known-concentration of size standard on the gel.
• 12x 454 runs to get 20X coverage
Other Races/Species: The H. melpomene rosina and H. cydno (galanthus) data will come from deep short-read sequencing, either Illumina or SOLiD, whichever makes the most sense at the time.
H. cydno is ~ 2.5% mtDNA divergent from the ‘reference’. The thought is we would comfortably align genes & coding regions, but that many intergenic regions might not assemble.
Use ½ of a Illumina run (3 lanes) for each of these additional lineages.
Following from Jim’s presentation of his project, we also decided to collect DNA for a rayed amazonian race, H. m. aglaope which might be sequenced as well as/in addition to H. m. rosina.
Assembly: This is potentially a very problematic issue. The best case scenario is that we can feed all the 454 data into the newbler assembly software, wait a day, and have our scaffolds. A newbler assembly is effectively free and there is a clear precedent of success for de novo assembly for organisms with genomes up to ~200 Mb (i.e. D. melanogaster) with an N50 of 20-50kb for contigs and 3 Mb for scaffolds. The scary part is that newbler currently chokes on the Helicoverpa data (~500 Mb genome @ 12x coverage) and will only assemble up to 6x and then dies. Fringy et al. are actively working on this. It is currently unclear how our data will assemble using Newbler…
Alternatives include: Arachne (Broad Institute), Jazz (JGI), MIRA
Manual annotation the ‘old fashioned’ way: three levels i) atlas assembly ii) overlap graph [one massive contig; massage to get rid of repeat joining; “lots of tuning”; Phrap] iii) join to data [ mapping/ fish on chromosomes ]
• The issue is that this manual approach is not scalable, takes months of work, and will likely cost ~$25,000 if it is needed.
Mapping Scaffolds: Once we get a good set of scaffolds, the next step will be to order, orient, and assign them to linkage groups. There was a clear consensus that linkage mapping with RAD-tag markers would be the best way to do this.
Mapping family: Luana and Chris will aim to obtain a large mapping family from a cross between the reference strain of H. m. melpomene (if possible the inbred line) and H. m. rosina. This will be used for RAD genotype mapping giving a high density SNP map with reference sequence tags at each marker, many of which we hope to align to genome contigs allowing long-range scaffolding of the finished assembly.
Expected stats (ballpark): Given a genome of ~300 Mb, Fringy expects to have ~250 Mb in the assembly, ~10,000 contigs with N50=40 Kb giving rise to 100-200 scaffolds of ~500 Kb
Annotation & Curation: Following the rubrick introduced by Alexie, “annotation” is the automated identification of genes, while “curation” refers to the manual review, editing, and confirmation of the automated annotations.
Annotation – there are many options for annotation software such as
GLEAN: the best tool currently available; integrates gene models, ESTs, and other relevant data into a single ‘best’ set of gene predictions
NCBI RefSeq pipeline.
Full Length cDNA: It was suggested that 100-200 well-annotated full-length cDNA sequences would be useful for verifying gene predictions. Pick a diverse array of genes, including genes with low-level expression that might not get well-covered by ESTs…
Curation – it was widely acknowledged that this is an enormous task that will take many hours of work from many, many people. Also, this task should be viewed as a continuously ongoing process rather than something that would produce a ‘final product’
Annotation jamboree – there was a lot of talk about organizing an Annotation Jamboree, essentially a ~two week session (perhaps hosted at NESCent) where everyone convened to work on this effort. Fall 2009 was suggested as a tentative time .
Curation by gene group – another common and popular idea would be to assign curation tasks by gene groups, where a lab with a interest in a particular type or family of genes would be responsible for annations. Offering authroship on final genome paper could potentially be a good way to entice diverse groups into participating in the annotation/curation effort
Curation & coursework – Adriana Briscoe suggested the idea of creating a course or course module where students would be responsible for curation of a gene set, and would have to defend their decisions and explain their results. (see below in Outreach section for further details).
Bioinformatic infrastructure – it was agreed that successful broadscale community (or student) participation in the curation effort would require establishing an accessible bioinformatic framework to allow curation to proceed in parallel. It was not clear whether such tools already exist or how exactly this would come into existence, though it would probably involve the servers set up by Bob Reed and the ffrench-Constant group (particularly Alexie P.) would probably play a key role.
Time Frame: there are currently ~5 insect genome projects in the works at Baylor, but according to Fringy precedence would likely be primarily determined by who first gets the samples to Baylor soonest. Once that happens, the sequencing would proceed quite quickly, perhaps completed in a month or less.
The bioinformatic challenges associated with generating a ‘well-groomed’ and widely accessible genome sequence are substantial and essentially indefinite; they will require continuous and ongoing efforts for years to come. The most obvious and immediate challenge we face is developing the hardware & software infrastructure for genome data sharing and curation. Another immediate challenge is communication & coordination within the community. Longer-term goals include integrating a wide range of genetic and diversity data into a database that also contains genome information.
It was not immediately clear at the meeting where the financial or human resources would come from to support these bioinformatic challenges in the long term, particularly with regards to the genomic data. The ffrench-Constant and Blaxter labs were both suggested as likely candidates during the meeting, but after further discussion it was clear that neither group currently had the resources to support this project in its entirety. Discussion centered around the need for a dedicated technician to develop, coordinate, and maintain the bioinformatic resources. Alexie P. suggested he (or the ffrench-Constant lab) could supervise this technician could money be found to fund the position. There was consensus that all future grants coming out of HGC labs should write in funds for bioinformatics support (such as hiring a dedicated technician).
More immediately, there are several important resources already in place. Bob Reed has secured servers to host the genome database and enlisted experienced bioinformaticians to help with the set-up. Alexie and the ffrench-Constrant lab are about to launch a GMOD version of ButterflyBase. And Jim Mallet has funds immediately available from his trans-species transcriptome project to fund bioinformatics support.
Genome Data sharing and curation
Servers – Bob Reed has secured 3 servers to host the genomic information and other related content/fuctions. One will be a genome browser, the second a BLAST server, and the final is a backup. These will be based at UC Irvine.
Set Up – Scott Cain (CSHL) and David Clements (NESCent) are ‘on board’ to set up these servers to run a GMOD database (GBROWSWER). There will be a systems administrator at UC Irvine maintaining these systems
Annotation/Curation software module– One important piece of the bioinfomatic infrastructure that was discussed extensively (though more often implicitly than explicitly) is a gene curation module. The idea is to have a easy-to-use tool that will allow manual curation of gene predictions to be performed by pretty much anyone. This tool would be the backbone of any annotation jamboree or course module that have many people working in parallel on annotations.
Genome Database Long term – the database should be structured on a GMOD/CHADO schema, which is currently the ‘industry standard’ and will support the ‘diversity modules’ that O. McMillan has been working on. Other suggestions include modelling the interface on the pea aphid database (aphidbase.com) and incorporating an Intermine/Flymine like module (Gos Miklam at Cambridge heads this)
CamTools – Currently C. Jiggins will host/support data sharing at Cambridge via CamTools. There is tentative agreement in place among HGC labs that all ‘genomic’ scale data will be posted immediately on CamTools for the broader community to access. Chris has added a post to heliconius.org about the agreement underlying data sharing on the site – click here.
ButterflyBase & Cornwall Resources
In contrast to data sharing (as hosted by CamTools and the UC-Irvine servers), Alexie shared resources available for data processing
96 CPUs & 15 Terabytes disk array, backed up in Cornwall. These are available to support Heliconius research but the precise role of the Cornwall group in bioinformatic analysis remains to be determined.
2500 CPUs HPC High performance computing in Dresden
A new release of ButterflyBase is coming soon. It will be based on a GMOD framework.
Community Organization & Communication
Heliconius.org – this website will serve as ‘point of departure’ for all things Heliconius, encompassing both information for public audiences as well as the research community. Chris has already updated the website using a WordPress content management system. If you want to post something please sign up as a user. Currently all posts are ‘public access’ but in the future we can password protect posts if we feel there is a need to do so.
Google sites – in the short term we are using tools from Google to organize and communicate
A google ‘wiki’ that allows content and announcements to be shared and communally edited
The meeting summary & MOU (this document)
Major Tasks and Working Groups:
The meeting culminated with a focused discussion of immediate tasks/goals and assignment of individual people to “working groups” responsible for periodically updating the broader community on progress in specific areas. These working groups consists of 1) Bioinformatics resources, 2) Animal Husbandry
1) Get high molecular weight genomic DNA to Baylor. Brian Counterman is going to take care of this. Time frame is by end of Feb? He will experiment with several techniques including Phenol/Chloroform, Ethanol Precipitation, and kit-based approaches using non-critical tissues. We are shooting for a minimum of 60 ugs of DNA.
2) Setting up Genome Server – Bob Reed has secured machines and server space through UC Irvine to host the genome database. In addition, he arranged for funding for a IT person to maintain the browser. Bob has also been in close contact with the GMOD folks who are happy to help us get the browser up and running but they will not maintain it.
3) Mapping populations need to be established 1) using the inbred H. melpomene stock crossed to H. m. rosina. Chris suggested that Luana Maroja might initiate and maintain these crosses. A priority is to get the inbred line to Larry + elsewhere.
Of critical importance is whether we can use RAD-Tag mapping to generate linkage scaffolds to aid in the genomic assembly. Fringly seemed to think this was an excellent idea. Chris is collaborating with Mark Blaxter to develop this technique, and Tony Long is another potential collaborator.
4) Complete the MOU among the group members. Will be largely based on the existing draft written by Owen/Bob/Chris.
5) Summarize meetings and put out monthly newsletters…Sean and Jamie will head this effort
6) Pursue funding on an individual and community basis.
7) Communicate with Baylor and arrange payment! – Owen and Chris.
Management Committee: Owen and Chris will coordinate the interaction with Baylor by means of regular conference calls during the course of the project. The meetings will be advertised by email ahead of time so that others can join in if needed/interested, and results will be circulated or posted on heliconius.org.
Bioinformatics working group:
This working group consists of Alexie Papanicolaou, Paul Wilkinson, Owen McMillan, Jim Mallet, Rob Jones, and Chris Jiggins. The major goal is to pursue and develop bioinformatics infrastructure and man power.
Alexie repeatedly argued that we need to provide a salary for a full-time database manager who could be primarily associated with the ffrench-Constant lab group. He suggested this would cost about ~18-20K/year. It seemed clear that Alexie and Paul have the expertise to handle the main computational challenges associated with the genome browser and database but not the time.
One major goal is also set up a browseable EST database maintained by a dedicated technician. Jim is going to discuss this with Mark.
Owen argued that we need to decide as a community what we want for our bioinformatics dollars. He suggested getting a CHADO butterfly base up and running was important and also highlighted the need for a diversity browser.
Finally, Alexie made it clear that developing the resources to handle the genome data is something that must be done NOW before the data starts coming in.
Animal Husbandry Working Group:
This working group includes Chris Jiggins, Brian Counterman, Owen McMillan, Larry Gilbert, and Marcus Kronforst. (others?).
The major goal of this group is to coordinate efforts to establish and protect critical butterfly stocks. Chris and Owen are managing the facility in Panama which will aim to maintain the most critical stocks, but it will be a priority to spread these around to other labs such that we have a back-up if anything is lost in Panama. Larry already agreed to take the melpomene melpomene line in Texas. This working group will handle Major Tasks 1 and 3 listed above. This group should also contribute to the Living Stock grant proposal.