05 March 2014

We are now in the Big Data era, where “the data scientist will become the most sexy job in the world” said by NYT. And the IT world frequently mentioned “90% of the data in the world today has been created in the last two years alone”. It seems like a strong wave of data explosion is coming!

Amazing! However, I feel like we have nothing to do with it until recently when I checked the NCBI SRA database for a talk. I found that, we, the maize community, are right in the midst of the Big Data explosion. According to the database report, while after doing some accessions lookup, in the past several years, especially the last two years, about 13,000 Giga base-pairs were deposited from the maize community. That is about 6,500 depth of coverage of the maize genome if roughly assuming the maize genome is 2 Giba bases in size. This estimation not includes the on-going projects and non-submitted datasets. More importantly, this is happening just five years after the accomplishment of the maize B73 reference genome project. If we review the complete datasets in the SRA database, maize NGS data actually only accounted for < 0.1%. Until the end of Feb. 2014, more than 2,300 Tera bases reside there. The “90% genome sequencing data were generated in the past two years alone”. See, IT’s data theory could be perfectly aligned to the sequencing industry (see the figure below).

alt text

As the data explosion, NGS data’s storage, sharing and publishing become a worldwide challenge. To overcome it, ~150 academic institutions and private companies went together to join the Global Alliance for Genomics and Health. While, Google Genomics is one of them. As the joining of some IT tycoons, genomics now become a multi-disciplinary challenge. Genomics and its related Bioinformatics are upgraded to their 2.0 version. I never thought that, trained as a geneticist, I may one day work for Google. Good, bad or ironic? It might be too optimistic. There is a long way to go and the industries are actually more interested in the human health related area. As a relative small research community for maize or plant genomics studies, we are less attractive from industry’s point of view.

Here, I will try to interpret the maize NGS data a little. This kind of study may be extended to other crop species later. After a close look, the data presented their own story. This kind of story might not be extracted from the papers alone. Below, I listed the top 10 maize projects according to their data output.

alt text

From the above figure, it seems like the maize genetic studies were expanding from their traditional research areas. Thanks to the continuous declining of the sequencing price, many genome wide studies become possible, such as geno-pheno association, transcriptome, methylome, etc. At the same time, the resources and talents were under “exon reshuffling” during this process. Some Chinese groups, including those from CAU, CAAS and Huazhong Agricultural University are rising.

Top 10 projects (download complete table

Order Center Size (Gb) Summary
1 CAAS 2426 transcriptome
2 MSU-BUELL 1247 Pan Transcriptome
3 CAU 1224 Genome Re-sequencing
4 CSHL 964 Maize HapMap II
5 Academia Sinica 626 Transcriptome of maize embryonic leaves
6 University of Minnesota 619 Transcriptome of IBM RILs
7 CSHL 572 Methylome
8 ISU 512 Zeanome
9 Cornell University 493 Breeding efforts in Africa
10 MSU-BUELL 450 Gene Expression of vitamin biosynthesis

In the U.S., from the above summary table of the sequencing data, I learned that different labs have their own focus. The lab here, sometimes, should also refer to the efforts from their collaborators. Among these labs, Buckler’s group at Cornell, after finishing the zea mays HapMap project, was now aiming to sequencing every species of maize in the world using Genotyping-By-Sequencing (GBS). Recently, Buell’s lab at MSU released a large set of transcriptomic data. This group may lead the maize transcriptome analysis in the future. A group at CSHL did a lot of Methylome study. Another group at CSHL, Jackson’s lab, focused on the development studies, their efforts of transcriptomic sequencing at various developmental stages may eventually lead to a developmentome (I invented this term) some day. Other research subjects, such as ISU’s Zeanome and PAV, UC Davis R-I group’s domestication and evolution and University of Minnesota’s IBM RIL eQTL mapping, et. al..

Sequencing alone could not solve the biological problems until applied with sophisticated analyses. When genetics meets Big Data, the analysis itself become a problem. Traditional training did not prepared genetists with big data skills. In the future, the one who mastering the data tools and making sense of data would be more likely to success I guess. Genetics is now not limited to pipetting, pollinating or cloning, coding skills should be added to our toolkit as well.

Big Data era in maize is coming, are you ready yet?

blog comments powered by Disqus