The longest running ENA collaboration, the International Nucleotide Sequence Database Collaboration (INSDC), has been underway for over a quarter of a century and now serves as a model for data sharing in the life sciences.
Amongst EBI data services, it is the ENA that harbours data from the largest range of organisms. Each of approximately 300,000 species names is associated with at least one archival nucleotide sequence record. Complete genome services at EBI have greatly limited organism coverage; we know about 4,500 completed and ongoing sequencing projects. Following the route of biological information from genomic nucleotides through transcription, translation, protein interactions, pathways and whole systems, organism coverage rapidly drops (eg. ~150,000 species in UniProt, 209 in ArrayExpress, 22 in Reactome). The ENA uses the taxonomic classification of organisms that is maintained at NCBI, but through INSDC has strong collaboration with the group such that newly found organisms are classified as sequence is submitted.
Other EBI data services use the same classification (with a limited number of modifications), albeit with fewer nodes. EBI data are therefore already strongly organised according to NCBI species names and taxonomic classification. The logical connection points between EBI and biodiversity and taxonomic data are at the level of organism names, the most exhaustive EBI representation of which is currently managed as part of ENA. While biodiversity and taxonomy communities require ‘deep links’ into EBI data at levels other than nucleotide sequence (eg. links to protein functional information), these links are most simply resolved through primary links to the ENA representation of taxa and secondary associations that are already maintained at EBI (such as through taxonomy and explicit cross-references).
Early in the 2011, EBI deployed a public prototype for this service and have since deployed a more robust and extensible production service through back-end engineering developments. Components of the service include an indexing warehouse, into which we routinely load sequence accession to taxon identifier mappings, an XML feed for top-level taxonomic records, integration into ENA search tools, integration into the ENA browser retaining a taxon entry point and specific taxon view and REST services in support of programmatic use of taxonomic information.
Taxon centric portal at ENA (EMBL-EBI)