Bioinformatics & Data Integration

Rationale and objectives

The objectives of WP 5 are:

  • to provide an integrated view of microbial diversity and function in the marine environment;
  • to develop innovative software approaches allowing users from biotechnology as well as ecosystems research to
    exploit information on microbial communities
  • to support users in effectively managing, analyzing, and sharing genomic and metagenomic data.

The WP 5 team is building the Micro B3 Information System (Micro B3-IS) to provide the bioinformatics capacity for the processing, analysis and biotechnological exploitation of marine biodiversity data. It supports users in the complex process of handling sequence data arising from studies of a range of organisms (protists, viruses and prokaryotes), both in isolation and in mixed organism samples. The system handles a variety of sequencing platforms whilst integrating interpretation of sequence data with related environmental data.

The Micro B3-IS will support the following five generalized use cases:

  1. A user plans to generate sequence data from one or more samples and wants to manage all steps from sequencing, assembly, data processing, and data analysis all the way through to successful publication.
  2. A user has sequence data and wants to make comparisons to already processed and published data, in terms of gene occurrence, metabolic capabilities and community assemblage.
  3. A user wants to perform ecological statistical analyses in order to find patterns in the sequence data and to explain these patterns in the context of environmental variables from on-site measurements and environmental data grids from oceanographic databases.
  4. A user wants to query, view and download quality-controlled information on genes, genomes or metagenomes with specific traits, in certain geographic areas or from specific environmental conditions.
  5. A user wants to collaborate with other users on specific data and add additional data and information.

To perform these tasks Micro B3-IS consists of three main components: The data processing component comprises bioinformatics pipelines which perform tailored, automatic sequence annotation for metagenomes and genomes from protists, viruses and prokaryotes. Micro B3 utilizes social web technologies to share and collaborate on data. In the data integration component the quality controlled and processed data are  integrated with contextual (meta)data from the environment obtained by on site and remote sensing measurements (WP 3). This second component lays the basis for large-scale statistical analysis and modelling in ecosystems biology (WP 6) as well as to provide candidates for biotechnological applications (WP 7). Thirdly, the Micro B3 community service module provides end-user tools for community sequence annotation, visualization of large-scale, multidimensional datasets and statistical tools for ecological analysis.

All data and services are being published on the web for browsing and for programmatic access via web services. Close interaction with WP 9 ensures timely and appropriate dissemination of the developments.  Each component is composed of standards-based, integrated, and modular open source software.

Recent progress

Progress towards these objectives was made by implementing the Micro B3-IS, which integrates all software components developed within Micro B3. The technical infrastructure components as well as the development and communication environment are now established. This was achieved by setting up an internal Wiki, a software issue tracker system and a source code repository. Most importantly, the WP 5 team established the first version of Micro B3-IS which is available for testing at http://mb3is.megx.net/.

Highlights in data processing so far are:

  1. Together with other work packages this work package set up structures and retrieval services for harvesting environmental and sequence data based on a common interoperability ‘core’ between all Micro B3 data resources.
  2. The new processing pipelines are filtering the data and performing bioinformatic analysis of the sequence data for integration in the Micro B3-IS database.
  3. The integration of the developed software components (such as MegxBar, PubMap) is supporting the augmentation and curation of the databases.

Major achievements have been the implementation of a variety of secured web services including first implementation of the "bioinformatics discovery pipeline for biotechnology" which is useful for WP 7.

As part of the Micro B3 project, EMBL-EBI’s European Nucleotide Archive (ENA), a comprehensive molecular resource, has developed and launched important new public programmatic services to improve searchability and access to marine-related sequence data. The services allow software applications such as MEGX/MEGDB to directly discover and retrieve data from ENA to be included in ecological analysis and visualisation.

The three services launched are:

Teams from WP3, 4 and 5 have resolved a core set of data classifiers in time and space that allow cross-domain data sets to be integrated. A report on interoperability with third party resources is available focussing on the integration of existing oceanographic, biological and molecular data resources. It  describes a common interoperability ‘core’ between all Micro B3 data resources and focuses on the two major use cases: data submissions and data retrieval.

AWI, with contributions from MPIMM, implemented the Interactive Ecological Analysis Guide (IEAG) in conjunction with preliminary work towards planning the Ecological Analysis Tools for Microbial Ecology (EATME) software package. The IEAG’s main features and format were drafted, described in an internal report on evaluation and tested. Most of the guide’s content has been implemented and a trial version has been used in the Micro B3 multivariate statistics training course (17 - 21 June 2013). Soon the guide will be publicly available here.

Currently, workflows for analysis of, e.g., OSD-related data are being developed in close cooperation with the EU project BioVeL to finally generate new information and knowledge for scientists (WP 6 and WP 7).

Overall, the WP 5 team has now set up the Micro B3-IS, established a software development environment and means of communication necessary to continue specifying and implementing an increasing number of features in an iterative approach.

Lead of WP 5: Renzo Kottmann, Max Planck Institute for Marine Microbiology