Corresponding author: Ed Baker (
Academic editor: Lyubomir Penev
We describe an implementation of the Darwin Core Archive (DwC-A) standard that allows for the exchange of biodiversity information contained within the
Biodiversity data are frequently published in esoteric formats or described using non-standard terminology, lacking sufficient contextual information to facilitate reuse, or a technical mechanism to facilitate data exchange. The
Scratchpads (
This paper describes our technical solution to implement a single DwC-A format to exchange data with a variety of external systems. This contrasts with previous methods of data reuse using DwC-A that have typically been developed for a single common dataset and intended for reuse by a single partner (e.g.
This project was funded, and uses infrastructure developed by the European Union funded ViBRANT project (Contract no. RI-261532) and the Natural Environment Research Council funded eMonocot project (Grant numbers 279981, 279984 & 27997).
Download page:
Bug database:
Vendor:
Platform: Scratchpads/Drupal
Programming language: PHP
Language: English
Service endpoint: [scratchpads.url]/dwca.zip
Type (Git, SVN, CVS, Arch, BK): Git
Browse uri (CVS, SVN, BK):
Other
The code for this project is licensed under the GNU General Public License, version 2.
The
In addition to the data files each archive has a meta.xml file that defines what the fields are in each data file, how rows and fields are terminated in each file, and what type of data are contained in each file (rowType). Of particular interest to our implementation is the fact that for a particular file in the archive the meta.xml can define multiple meanings by using different rowTypes. For example this allows us to treat descriptions.txt as both TDWG Descriptions (as defined at
The Drupal module we have written, dwca_export (Fig.
To conserve server resources the DwC-As are generated as a background process, triggered by a change being made to the content of the site. It is also possible for the Scratchpad administration team to rebuild the archives on demand using Drush via the command line (see Drush section below).
The functionality provided by this module can be enabled by the maintainer on any Scratchpad site through the Tools page found at Admin > Structure. Once enabled, the archive file will become available at [scratchpads.url]/dwca.zip after it has been created by the system.
This module is for Scratchpad maintainers who wish to share the content of their site with third party aggregators.
The classification.txt file forms the core of our star schema and is consistent between all third party users.
Scratchpads as a system are designed to facilitate classifications conforming to any (or no) code of nomenclature. While we change the Scratchpad interface to match the terminology of a particular code on a per-classification basis, Darwin Core standardises the terminology used for the taxonomic and nomenclatural status of a taxon (
The field used are shown in Table
Descriptions used for describing the identification and biology of taxa in Scratchpads are defined using the TDWG
In Scratchpads it is possible to create one or more taxon description nodes for every taxon. These nodes allow entry of text into any of the fields described above. In a DwC-A each of these fields is considered a record in itself (Fig.
The way that the eMonocot Portal and EoL handle textual descriptions of taxa and narratives on taxon biology is very different and our implementation needs to accommodate both. The eMonocot portal ingests these data using the
The EoL agents file is a non-standard (not conforming to the star schema) extension to DwC-A that records information about people who have contributed to the other content in the archive (e.g. authors of textual content, the photographer of images). As this rowType is only understood by EoL the presence of this file in the archive does not cause problems for other consumers of the DwC-A. The structure of this file is described in Table
For the eMonocot project there is a need for comments on data in the portal and comment replies in the Scratchpad to be synchronised across both platforms. This synchronisation was implemented using the DwC-A, ensuring that this one export mechanism could serve all of the project's needs. This extension to the archive does not conform to the star schema as it does not link directly to the central classification.txt. Despite this the rowType is, at present, only understood by eMonocot, and is ignored by other projects reading the archive. The data model used is a subset of the Open Annotation Data Model (
This file contains information from the Scratchpad site's bibliography. The structure of the file is described in Table
This table contains the museum/herbarium specimens and observation records present within a Scratchpad. At present Scratchpads use DwC 1.2.1 as the internal specimen/observation standard. This is currently being upgraded to DwC 1.4. Table
The vernacular names extension records common names, their language, where in the world they are used and any remarks (Table
The eMonocot project makes extensive use of its own unique identifiers derived from the World Checklist of Monocots (WCM). A separate Scratchpads/Drupal module, emonocot_dwca, handles the replacement of the Scratchpads UUID with the eMonocot/WCM identifier. In addition, the licences for taxon descriptive content from eMonocot sites are derived through a link to a publication in the Scratchpad bibliography. A licence field on that publication is used to assign a licence to all the descriptive content that has been copied from that resource. This function is not present in standard Scratchpads, and the emonocot_dwca module ensures that the correct licence is applied to content when it is exported in the DwC-A.
One additional Drush command is provided by the module dwca_rebuild. This function needs to be called twice to build an archive. The first time it is called it creates a number of processes run via the POSIX nohup command (POSIX, the Portable Operating System Interface, is a set of standards for Unix-like operatinng systems) to prevent problems with timeouts when connected to a server via SSH. These processes run through the individual views used to create the DwC-A files and output their contents to text files. The second time it is run the various description views are aggregated into a single file, and this, along with the other exported text files are then zipped to create the archive.
This feature is used by the Scratchpads Team (or the server administrator where the site is not hosted by the Scratchpad project) and is not available to Scratchpads users.
From within a site directory: drush dwca_rebuild
Using an alias: drush @cypriioideae.e-monocot.org dwca_rebuild
This module has dependencies on a number of standard Drupal contributed modules (
A number of tools exist that can perform validation on a DwC-A file, for example the tools developed by EoL and GBIF. While these tools are useful they are insufficient for complete verification of an archive. The
In order to provide a consistent and correct DwC-A from a Scratchpad to any of our content partners the decision was made to create
Checking the structure of the file allows us to check for errors such as column transposition during any future development to the dwca_export module. The precise checking of data in each row also allows us to check the data in the Scratchpad. This gives us the potential to add a feature that warns users of potential errors in the content present within their Scratchpad. For example, running the validator against a number of sites identified several occasions where ISSNs had incorrectly been entered into the ISBN field. At present the validator is run as a stand-alone tool – future work will provide Scratchpad maintainers with a list of identified errors with links to edit content flagged by the validator.
Once the module is enabled the Darwin Core Archive is available for use by our content partners in ViBRANT and eMonocot as well as by anybody else who has an interest in the data. Users of the data must abide by the licence (generally Creative Commons) that the site maintainer has specified. It should be noted that individual rows of data are likely to have different licences.
By default all of the content in the site that is available to the public via the standard web interface is available in the archive. Content creators can hide content from both the Scratchpad website and the Darwin Core Archive by marking that content as 'Unpublished'. It should be noted that due to the caching of web pages and the archives being created by a cron task that content may take a few hours to disappear from both the website and the archive.
At present only content types specified by the Scratchpad team (Specimens, Taxonomy, References, Images, etc.) can be mapped into the Darwin Core Archive. In the future we plan to provide an interface to allow custom content types created by site maintainers to be mapped into the archive.
We would like to thank our partners in the eMonocot and ViBRANT projects who have helped to develop and consume the DwC-As that we create: Ben Clark (University of Oxford) and the eMonocot Content Team (Royal Botanic Gardens Kew), Andrew Hill and the team at (Vizzuality/ViBRANT), Lorna Morris and colleagues (Berlin Botanic Garden/ViBRANT).
In addition we would like to thank Cyndy Parr, Jen Hammock, Patrick Leary and the rest of the Encyclopedia of Life team for their help in resolving incompatibilities between the DwC-A standard and the EoL implementation.
The dwca_export Scratchpads/Drupal module was developed by Ed Baker and Simon Rycroft from a prototype by Lorna Morris and Andreas Kohlbecker.
The emonocot_dwca Scratchpads/Drupal module was developed by Ed Baker.
The validator tool was written by Ed Baker.
Vince Smith fielded the considerable correspondence between the eMonocot, Encyclopedia of Life and Scratchpad teams in pursuit of constructing a single archive that could satisfy the diverse needs these different projects.
Vince Smith and Ed Baker wrote the manuscript.
Darwin Core Archive created from the eMonocot
The star schema showing the relation of Darwin Core Archive extension files to the core file.
Data flow from Scratchpad to Darwin Core Archive. The dwca_export module (orange) defines a number of custom Drupal Views (queries) that collect the data required for archive generation from the Scratchpad (blue) and combines them with the meta.xml which describes the information in the archive. For eMonocot Scratchpads the emonocot_dwca module (green) provides an intermediary function replacing the Scratchpads internal unique identifiers with those used throughout the eMonocot project (see eMonocot modifications section).
A single Taxon Description node on a Scratchpad corresponds to one or more rows in the description.txt file.
The fields used in our classification.txt – the core of our DwC-A star schema.
Field | Description | Term |
---|---|---|
Taxon ID | A universally unique identifier (UUID) of this name (World Checklist of Monocots [WCM] identifier for eMonocot Scratchpads) | |
Taxon Name | The taxon name – made by concatenating the unit names and unit indicators (the Scratchpads stores all parts of the scientific name, and indicators such as sp. & spp. in separate fields) |
|
Taxonomic Status | See Table |
|
Taxonomic Rank | e.g. species, genus, family |
|
Taxon Author(s) | Plain text names of the author(s) of this taxon |
|
Reference | Citation of the reference containing the description of this taxon |
|
Reference ID | URL to the reference containing the description of this taxon within the Scratchpad |
|
Taxonomic Parent | The parent of this name in the classification, if this name is accepted. |
|
Nomenclatural Status | See Table |
|
Accepted Name | The UUID of the associated accepted name, if this name is not accepted |
|
Mapping from the Scratchpads taxonomy model to the GBIF Darwin Core taxonomy model
Scratchpads: Usage | Scratchpads: Unacceptability Reason | DwC: taxonomicStatus | DwC: nomenclaturalStatus |
---|---|---|---|
accepted/valid | accepted | ||
valid | valid | ||
not accepted / invalid | -None- | ||
not accepted / invalid | synonym | synonym | |
not accepted / invalid | homotypic (nomenclatural) synonym | homotypicSynonym | |
not accepted / invalid | heterotypic (taxonomic) synonym | heterotypicSynonym | |
not accepted / invalid | homonym (illegitimate) | heterotypicSynonym | illegitimum |
not accepted / invalid | superfluous renaming (illegitimate) | homotypicSyonym | superfluum |
not accepted / invalid | rejected name | synonym | rejiciendum |
not accepted / invalid | invalidly published, nomen nudum | synonym | nudum |
not accepted / invalid | invalidly published, other | synonym | invalidum |
not accepted / invalid | misapplied | misapplied | |
not accepted / invalid | pro parte | proParteSynonym | |
not accepted / invalid | horticultural | ||
not accepted / invalid | database artifact | ||
not accepted / invalid | orthographic Variant (misspelling) | synonym | orthographia |
not accepted / invalid | other | ||
not accepted / invalid | junior synonym | ||
not accepted / invalid | objective synonym | ||
not accepted / invalid | subjective synonym | ||
not accepted / invalid | original name/combination | ||
not accepted / invalid | subsequent name/combination | combinatio | |
not accepted / invalid | junior homonym | synonym | illegitimum |
not accepted / invalid | homonym & junior synonym | synonym | |
not accepted / invalid | unavailable, database artifact | ||
not accepted / invalid | unavailable, literature misspelling | synonym | orthographia |
not accepted / invalid | unavailable, incorrect original spelling | negatum | |
not accepted / invalid | unavailable, suppressed by ruling | oppressa | |
not accepted / invalid | unavailable, nomen nudum | synonym | nudum |
not accepted / invalid | unavailable, other | ||
not accepted / invalid | unjustified emendation | ||
not accepted / invalid | unnecessary replacement | synonym | superfluum |
not accepted / invalid | nomen oblitum | ||
not accepted / invalid | misapplied | ||
not accepted / invalid | pro parte | proParteSynonym | |
not accepted / invalid | nomen dubium | synonym | dubium |
Mapping schema of Scratchpads textual descriptions of taxa to the GBIF (as used by eMonocot) and EoL extensions in the Darwin Core Archive.
Field | Description | GBIF term | EoL term |
---|---|---|---|
Taxon ID | The Scratchpads universally unique identifier for the taxonomic name (WCM identifier for eMonocot Scratchpads) |
|
|
Type | The type of the textual description: "general", "ecology", "behaviour", etc. |
|
|
Text | The textual description. E.g. if Type is "general" this field will contain a general description of the taxon |
|
|
Rights | Textual description of the rights associated with this content, e.g. "All Rights Reserved" |
|
|
AccessURI | The URL of the Scratchpad node containing this description |
|
|
Source | Scratchpad URL of the bibliographic reference this description is from (only applies to eMonocot sites) |
|
|
Copyright Owner | The copyright owner of the description |
|
|
Language | The language of the description |
|
|
CV Term | Image keywords |
|
|
Format | URL of the SPM type of the description, e.g. |
|
|
Type | MIME type of the textual content, generally: text/html |
|
|
agentID | UUID of the author who contributed the content to the site |
|
|
License | URL of the license used to release the content, if any |
|
|
Identifier | A unique identifier for this particular textual description. Formed by concatenating the universally unique identifier of the description node, the # character, and the field name. |
|
|
eol_agents
Field | Description | Term |
---|---|---|
User ID | The universally unique identifier of the user |
|
Family Name | The user's last/family name |
|
First Names | The user's given/first names |
|
Full Name | Full name of the user – concatentaion with space of above two fields |
|
Organisation | The organisation the user works for, if any |
|
Username | The user's username on this Scratchpad |
|
Structure of comments.txt – the non-standard extension for synchronising comments between a Scratchpad and the eMonocot portal
Field | Description | Term |
---|---|---|
CommentID | URL of the comment |
|
Target | URL of the node the comment was made on |
|
Title | Title of the comment |
|
Body | The comment itself |
|
Created | Date and time the comment was created |
|
Modified | Date and time the comment was last edited |
|
references
Field | Description | GBIF / PURL Term | EoL Term |
---|---|---|---|
Taxon ID | UUID of taxa in the publication | ||
Identifier | UUID of the reference in the Scratchpad site |
|
|
DOI | Digital Object Identifier |
|
|
ISBN | International Standard Book Number |
|
|
ISSN | International Standard Serial Number |
|
|
Citation | Plain text citation of the work |
|
|
Title |
|
|
|
|
|||
Node URL |
|
||
|
|||
Language |
|
|
|
Indicates if if publication is original description of a taxon |
|
||
Publication Date |
|
||
Created Date | Date the reference was added to the Scratchpad |
|
|
Modified Date | Date the reference was last modified in the Scratchpad |
|
|
|
|||
|
specimens.txt
Field | Description | Term |
---|---|---|
Taxon ID | ||
Type Status | e.g. Holotype |
|
Institution Code | e.g. BMNH for Natural History Museum, London |
|
Collection Code | e.g. E for Entomology |
|
Catalogue Number | Unique specimen identifier |
|
Latitude | Decimal latitude |
|
Longitude | Decimal longitude |
|
vernacular_names.txt
Field | Description | Term |
---|---|---|
Taxon ID | ||
Vernacular Name | The vernacular (common) name |
|
Language | The language of the vernacular name |
|
Locality | Where is the vernacular name used |
|
Remarks | Other information |
|