Big data in life sciences: hype or hope?

21 June 2017
​In science, business and society, the term big data has been on the rise for quite some time now. The last decade, however, the concept has evolved from an all-round label to tangible applications that affect our lives in numerous ways. In life sciences as well, big data technology is helping us cope with a seemingly endless tsunami of biological information. And considering the rapid pace of progress in our sector, we’re just seeing the tip of the iceberg.

Big data and technology are usually mentioned in the same breath. Due to the data’s immense volume, speed and complexity (see text box ‘Understanding big data: the 5V mnemonic’), new high-performance tools are vital. Both hardware (infrastructure for data acquisition, storage and computing) and software (mining, analytics, visualization, database querying, etc.) are needed to leverage the power of big data. In all these areas, we have seen significant progress over the last decades, including cloud-based solutions for data storage and analysis, and novel computing paradigms for fault-tolerant, scalable and distributed data analysis
(e.g. the Map-Reduce paradigm).

Big data in the life sciences
Applications in life sciences have been a driving force behind many big data projects. The European Bioinformatics Institute (EBI) – one of the world’s largest biological data repositories – currently stores 20 petabytes (1015 bytes) of data and backups concerning genes, proteins and small molecules. Genomic data accounts for 2 petabytes, a number that more than doubles every year. Novel advances in genomics such as metagenomics and single-cell genomics will provide a multiple of that volume in data, with several types of “omics” information at the level of individual cells per patient, including different cell types and tissues.
This could mean a huge asset for life scientists in the future: ideally, they have all that information at their fingertips, thanks to scalable and intelligent data integration engines.

Artificial intelligence at VIB
At the VIB-UGent Center for Inflammation Research, Yvan Saeys and his team are currently developing and applying machine learning techniques to leverage these big biological data sets, with the aim of building intelligent data models. For example, the group is working on novel techniques for visualization and analysis of high-throughput single-cell data, scaling to billions of cells and up to thousands of measurements
per cell. These methods can be used in a variety of contexts, including single-cell cancer studies, high-dimensional immunophenotyping and highthroughput compound screening. Another example of this machine learning approach are the so-called deep neural networks, systems with multiple hidden layers between the (complex) input and output layers. In the field of 3D electron microscopy, they enable us
to automatically annotate and mine large-scale imaging data sets.

Big data science as a vital skill
Based on the ever-increasing scale of the generated data, it is clear we need additional knowhow and
experience to use big data to our advantage and enable better hypotheses and interpretations of research results. To this end, young life scientists will need to be educated in big data science, in addition to their own fields of expertise. As for more senior scientists, they know better than anyone that lifelong learning is vital to stay at the forefront of research. Getting acquainted with the essentials and opportunities of big data will
certainly be part of that picture.

Go back to the overview '​Big data'

Yvan Saeys & Daniel Peralta