High throughput sequencing technologies, such as Roche 454 pyrosequencing and Illumina can enable semi-quantitative study of communities of single-celled organisms by generating hundreds of thousands of short sequence reads from a single environmental sample. However, to identify the taxa to which these reads belong requires a reliable database of reference sequences.
We maintain databases of taxa from all three domains of life found in marine and freshwater samples in the Canadian Arctic and subarctic, along with an accompanying file in Fasta format of the quality-checked reference sequences. These files are suitable for use in data-processing pipelines for next-generation sequencing using open-source software such as QIIME, mothur, or UPARSE, when the user wishes to assign taxonomic identities by sequence similarity to short reads.
Table 1. Number of sequences and sequence-length for three taxonomic databases
Domain
Number of Sequences
Mean sequence length (range)
Base-pairs
Eukarya
766
440
(216-657)
Bacteria
33,293
435
(304–571)
Archaea
2288
557
(532–591)
The creation of these databases has been described in Comeau et al. 2011 and 2012. Briefly, we targeted the V4 variable region of the 18S rRNA gene for Eukarya and the V6-V8 and V3-V5 variable regions of the 16S rRNA gene for Bacteria and Archaea respectively. Reference sequences were originally imported from the SILVA database for Archaea and the Greengenes database for Bacteria, and are labeled with the original accession numbers from these databases, while the Eukarya database was assembled de novo, based on taxa found in our studies. We have edited the taxonomic identifications to reflect recent developments in the literature and included high-quality sequences from environmental clone libraries alongside cultured representatives when the former represent clades that are widespread in arctic and subarctic aquatic environments. Taxonomic identification of uncultured clones is based on well-supported phylogenetic trees, and they have been rigorously screened for potential chimeras using UCHIME (Edgar et al. 2011).
Because our focus is on single-celled organisms, our coverage of Metazoa, Fungi, and Streptophyta (land plants) from the Eukaryota database is sufficient to identify and remove these sequences from a sample, but should not be used for detailed taxonomic analysis within these groups. By the same token, chloroplast reference sequences are included in the Bacteria database primarily with the goal of identifying and removing these sequences from analysis.
These databases have been successfully used in numerous studies of microbial communities in high-latitude coastal and offshore marine environments (e.g. Comeau et al. 2011, Monier et al. 2014), as well as high-latitude lakes and ponds (Comeau et al. 2012, Negandhi et al. 2014, Crevecoeur et al. 2015).
References
Edgar, R.C., B.J. Haas, J.C. Clemente, C. Quince, R. Knight, 2011. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics. doi: 10.1093/bioinformatics/btr381