Biodiversity Data Journal Biodiversity Data Journal Biodiversity Data Journal BDJ 1314-2836 1314-2828 Pensoft Publishers Biodiversity Data Journal 10.3897/BDJ.3.e4552 3892 Editorial / Correspondence Data Management Corrected data re-harvested: curating literature in the era of networked biodiversity informatics Miller Jeremy A. jeremy.miller@naturalis.nl Georgiev Teodor § Stoev Pavel | Sautter Guido Penev Lyubomir Naturalis Biodiversity Center, Leiden, Netherlands www.Plazi.org, Bern, Switzerland Pensoft Publishers, Sofia, Bulgaria National Museum of Natural History and Pensoft Publishers, Sofia, Bulgaria Institute of Biodiversity & Ecosystem Research, Bulgarian Academy of Sciences and Pensoft Publishers, Sofia, Bulgaria

Corresponding author: Jeremy A. Miller (jeremy.miller@naturalis.nl).

Academic editor: Donat Agosti

2015 21 01 2015 3 e4552 20 01 2015 21 01 2015 Jeremy A. Miller, Teodor Georgiev, Pavel Stoev, Guido Sautter, Lyubomir Penev This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Main Text

Science makes progress through a constant process of re-evaluation. Revision and error correction are inevitable and generally healthy for the advancement of science. In biodiversity literature, re-evaluation of earlier work can lead to new conclusions, such as a revised taxonomic determination. When significant errors are discovered, conscientious authors may correct the record by publishing an erratum or corrigendum.

Aggregated global biodiversity data is an increasingly powerful resource supporting research, conservation, policy, and public bioliteracy (Hardisty et al. 2013, Arzberger et al. 2004). Along with databases devoted to specimen collections and observation records, literature is an integral part of the biodiversity informatics ecosystem (Miller et al. 2012, Penev et al. 2012, Penev et al. 2011a, Penev et al. 2011b). Pensoft journals pioneered the routine distribution of primary specimen data from publications to a collection of online resources, including the Global Biodiversity Information Facility (GBIF) and the Encyclopedia of Life (EOL) (Penev et al. 2009, Penev et al. 2008, Penev et al. 2010, Smith et al. 2013, Chavan and Penev 2011, Penev et al. 2012, Faulwetter et al. 2014). In the era of digital biodiversity informatics, maintaining data quality presents new challenges. In the realm of corrected taxonomic literature, we argue the objective should be to amend the structured digital record so that the correct information appears on resources like GBIF and the disavowed data are expunged. At the same time, good publishing practice requires that the original document and associated data remain part of the permanent scientific record.

A recent paper on central European spiders included a number of taxonomic errors (Čandek et al. 2013). In a corrigendum published in this issue (Čandek et al. 2015), the authors duly correct the record. Data from the original publication have already been harvested by online resources including GBIF. To guarantee that the data is corrected not only in the scientific literature but also in GBIF, the Darwin Core Archive (DwC-A) file (which is the vehicle for distributing content to a collection of online resources; GBIF 2010, Wieczorek et al. 2012) has been updated and submitted to GBIF. The supplier (Pensoft) needs to trigger a re-indexing through the API (Application Programming Interface, a set of protocols that, in this context, is used to share data between software applications) and the content will be added to the indexing queue. Normally it takes few hours to be indexed (Markus Döring, GBIF senior software developer, pers. comm.). However, the original DwC-A file remains available for users to download from the journal web site. The original and corrected data files are clearly labeled as such and visible alongside the original publication. A link landing at the corrigendum will be added to the original publication metadata to facilitate its discoverability. In addition, the XML data file from the original article has been retained on the servers of Plazi, but the XML tags have been amended to render them no longer exposed for harvest. A modified XML document combining the original data with all corrections specified in the corrigendum (i.e., a single corrected document) has been made available as a supplementary document linked to the corrigendum, and will be uploaded to Plazi upon publication of the corrigendum. This will present the corrected data in XML form, permiting the export of treatments and data to various aggregators (Penev et al. 2012).

This demonstrates a small but important step toward insuring high data quality in the era of growing online networks of biodiversity data. The power of structured biodiversity data aggregated from many sources and freely available online is becoming increasingly valuable to a range of traditional and nontraditional data consumers (Moritz et al. 2011, Arzberger et al. 2004). It is in the interest of the general community and publishers in particular to insure that data are of the highest possible standard.

As large aggregations of data become increasingly important in myriad scientific disciplines, warnings are being sounded that the Achilles' heel of these otherwise promising enterprises is data quality. Big data need robust curatorial mechanisms to assure accuracy and reliability so that the promise of these great collaborative efforts is not squandered (Leonelli 2014, Mesibov 2013, Thessen and Patterson 2011, Hjarding et al. 2014, Belbin et al. 2013). An emerging solution is aimed at collections data from natural history research institutions, a major class of data suppliers to GBIF (Berendsohn et al. 2010, Robertson et al. 2014). The idea is to provide a mechanism for users to flag suspicious records and make possible errors known to data providers (who have the power to check and correct errors) and the broader user community (Wang et al. 2009, Tschöpe et al. 2013, Morris et al. 2013). Wide online access to primary biodiversity data through aggregating databases like GBIF facilitate unprecedented power for data comparison and scrutiny, well beyond what is possible with unnetworked collections databases and literature published on paper without structured digital data. Errors are inevitable in any field, but science is a self-correcting process. The path forward toward well-curated, accessible, aggregated biodiversity data can be accomplished with the participation of the whole community, including publishers, authors, institutional collections personnel, and end users.

Acknowledgements

Development of the data publishing toolkit was supported by EU BON (Building the European Biodiversity Observation Network), an FP-7 (European Union Seventh Framework Programme, 2007-2013) grant (No 308454). Thanks to Markus Döring (GBIF senior software developer) for answering our questions about the GBIF workflow.

References Arzberger P Schroeder P Beaulieu A Bowker G Casey K Laaksonen L Moorman D Uhlir P Wouters P 2004 Promoting Access to Public Research Data for Scientific, Economic, and Social Development Data Science Journal 3 135 152 http://dx.doi.org/10.2481/dsj.3.135 10.2481/dsj.3.135 Belbin Lee Daly Joanne Hirsch Tim Hobern Donald LaSalle John 2013 A specialist’s audit of aggregated occurrence records: An ‘aggregator’s’ perspective ZooKeys 305 67 76 http://dx.doi.org/10.3897/zookeys.305.5438 10.3897/zookeys.305.5438 Berendsohn WG Chavan V Macklin JA 2010 Recommendations of the GBIF task group on the global strategy and action plan for the mobilization of natural history collections data. Biodiversity Informatics 7 67 71 Čandek Klemen Gregorič Matjaž Kostanjšek Rok Frick Holger Kropf Christian Kuntner Matjaž 2013 Targeting a portion of central European spider diversity for permanent preservation Biodiversity Data Journal 1 e980 http://dx.doi.org/10.3897/bdj.1.e980 10.3897/bdj.1.e980 Čandek K Gregorič M Kostanjšek R Frick H Kropf C Kuntner M 2015 Corrigendum: Targeting a portion of central European spider diversity for permanent preservation Biodiversity Data Journal 3 e4301 10.3897/BDJ.3.e4301 Chavan V Penev L 2011 The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinformatics 12 S2 http://www.biomedcentral.com/1471-2105/12/S15/S2 10.1186/1471-2105-12-S15-S2 Faulwetter Sarah Markantonatou Vasiliki Pavloudi Christina Papageorgiou Nafsika Keklikoglou Kleoniki Chatzinikolaou Eva Pafilis Evangelos Chatzigeorgiou Georgios Vasileiadou Katerina Dailianis Thanos Fanini Lucia Koulouri Panayota Arvanitidis Christos 2014 Polytraits: A database on biological traits of marine polychaetes Biodiversity Data Journal 2 e1024 http://dx.doi.org/10.3897/bdj.2.e1024 10.3897/bdj.2.e1024 GBIF 2010 Darwin Core Archives – How-to Guide, version 1, released on 1 March 2011, (contributed by Remsen D, Braak, K, Döring M, Robertson, T) Global Biodiversity Information Facility Copenhagen 21 http://links.gbif.org/gbif_dwca_how_to_guide_v1 Hardisty Alex Roberts Dave The Biodiversity Informatics Community 2013 A decadal view of biodiversity informatics: challenges and priorities BMC Ecology 13 16 http://www.biomedcentral.com/1472-6785/13/16/abstract 10.1186/1472-6785-13-16 Hjarding Angelique Tolley Krystal A. Burgess Neil D. 2014 Red List assessments of East African chameleons: a case study of why we need experts Oryx FirstView 1 7 http://dx.doi.org/10.1017/s0030605313001427 10.1017/s0030605313001427 Leonelli S. 2014 What difference does quantity make? On the epistemology of Big Data in biology Big Data & Society 1 1 1 11 http://dx.doi.org/10.1177/2053951714534395 10.1177/2053951714534395 Mesibov Robert 2013 A specialist’s audit of aggregated occurrence records ZooKeys 293 1 18 http://dx.doi.org/10.3897/zookeys.293.5111 10.3897/zookeys.293.5111 Miller Jeremy Dikow Torsten Agosti Donat Sautter Guido Catapano Terry Penev Lyubomir Zhang Zhi-Qiang Pentcheff Dean Pyle Richard Blum Stan Parr Cynthia Freeland Chris Garnett Tom Ford Linda S Muller Burgert Smith Leo Strader Ginger Georgiev Teodor Bénichou Laurence 2012 From taxonomic literature to cybertaxonomic content BMC Biology 10 1 87 http://dx.doi.org/10.1186/1741-7007-10-87 10.1186/1741-7007-10-87 Moritz Tom Krishnan S Roberts Dave Ingwersen Peter Agosti Donat Penev Lyubomir Cockerill Matthew Chavan Vishwas 2011 Towards mainstreaming of biodiversity data publishing: recommendations of the GBIF Data Publishing Framework Task Group BMC Bioinformatics 12 S1 http://dx.doi.org/10.1186/1471-2105-12-s15-s1 10.1186/1471-2105-12-s15-s1 Morris Robert A. Dou Lei Hanken James Kelly Maureen Lowery David B. Ludäscher Bertram Macklin James A. Morris Paul J. 2013 Semantic Annotation of Mutable Data PLoS ONE 8 11 e76093 http://dx.doi.org/10.1371/journal.pone.0076093 10.1371/journal.pone.0076093 Penev Lyubomir Catapano Terence Agosti Donat Georgiev Teodor Sautter Guido Stoev Pavel 2012 Implementation of TaxPub, an NLM DTD extension for domain-specific markup in taxonomy, from the experience of a biodiversity publisher Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet] Bethesda (MD) National Center for Biotechnology Information (US) http://www.ncbi.nlm.nih.gov/books/NBK100351/ Penev Lyubomir Erwin Terry Miller Jeremy Chavan Vishwas Moritz Tom Griswold Charles 2009 Publication and dissemination of datasets in taxonomy: ZooKeys working example ZooKeys 11 1 8 http://dx.doi.org/10.3897/zookeys.11.210 10.3897/zookeys.11.210 Penev Lyubomir Lyal Christopher Weitzman Anna Morse David King David Sautter Guido Georgiev Teodor Morris Robert Catapano Terry Agosti Donat 2011 XML schemas and mark-up practices of taxonomic literature ZooKeys 150 89 116 http://dx.doi.org/10.3897/zookeys.150.2213 10.3897/zookeys.150.2213 Penev Lyubomir Hagedorn Gregor Mietchen Daniel Georgiev Teodor Stoev Pavel Sautter Guido Agosti Donat Plank Andreas Balke Michael Hendrich Lars Erwin Terry 2011 Interlinking journal and wiki publications through joint citation: Working examples from ZooKeys and Plazi on Species-ID ZooKeys 90 1 12 http://dx.doi.org/10.3897/zookeys.90.1369 10.3897/zookeys.90.1369 Penev Lyubomir Erwin Terry Thompson F. Christian Sues Hans-Dieter Engel Michael Agosti Donat Pyle Richard Ivie Michael Assmann Thorsten Henry Thomas Miller Jeremy Ananjeva Natalia Casale Achille Lourenco Wilson Golovatch Sergei Fagerholm Hans-Peter Taiti Stefano Alonso-Zarazaga Miguel Nieukerken Erik van 2008 ZooKeys, unlocking Earth’s incredible biodiversity and building a sustainable bridge into the public domain: From “print-based” to “web-based” taxonomy, systematics, and natural history. ZooKeys Editorial Opening Paper ZooKeys 1 1 7 http://dx.doi.org/10.3897/zookeys.1.11 10.3897/zookeys.1.11 Penev Lyubomir Agosti Donat Georgiev Teodor Catapano Terry Miller Jeremy Blagoderov Vladimir Roberts David Smith Vincent Brake Irina Ryrcroft Simon Scott Ben Johnson Norman Morris Robert Sautter Guido Chavan Vishwas Robertson Tim Remsen David Stoev Pavel Parr Cynthia Knapp Sandra Kress W. John Thompson Frederic Erwin Terry 2010 Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples ZooKeys 50 1 16 http://dx.doi.org/10.3897/zookeys.50.538 10.3897/zookeys.50.538 Robertson Tim Döring Markus Guralnick Robert Bloom David Wieczorek John Braak Kyle Otegui Javier Russell Laura Desmet Peter 2014 The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet PLoS ONE 9 8 e102623 http://dx.doi.org/10.1371/journal.pone.0102623 10.1371/journal.pone.0102623 Smith Vincent Georgiev Teodor Stoev Pavel Biserkov Jordan Miller Jeremy Livermore Laurence Baker Edward Mietchen Daniel Couvreur Thomas L. P. Mueller Gregory Dikow Torsten Helgen Kristofer M. Frank Jiri Agosti Donat Roberts David Penev Lyubomir 2013 Beyond dead trees: integrating the scientific process in the Biodiversity Data Journal Biodiversity Data Journal 1 e995 http://dx.doi.org/10.3897/BDJ.1.e995 10.3897/BDJ.1.e995 Thessen Anne Patterson David 2011 Data issues in the life sciences ZooKeys 150 15 51 http://dx.doi.org/10.3897/zookeys.150.1766 10.3897/zookeys.150.1766 Tschöpe Okka Macklin James A. Morris Robert A. Suhrbier Lutz Berendsohn Walter G. 2013 Annotating biodiversity data via the Internet Taxon 62 6 1248 1258 http://dx.doi.org/10.12705/626.4 10.12705/626.4 Wang Zhimin Dong Hui Kelly Maureen Macklin James A. Morris Paul J. Morris Robert A. 2009 2009 WRI World Congress Computer Science and Information Engineering 3 2009 WRI World Congress on Computer Science and Information Engineering Los Alamitos, California 731-735 http://dx.doi.org/10.1109/csie.2009.948 10.1109/csie.2009.948 Wieczorek John Bloom David Guralnick Robert Blum Stan Döring Markus Giovanni Renato Robertson Tim Vieglais David 2012 Darwin Core: An Evolving Community-Developed Biodiversity Data Standard PLoS ONE 7 1 e29715 http://dx.doi.org/10.1371/journal.pone.0029715 10.1371/journal.pone.0029715