Corresponding author: Falko Glöckler (
Academic editor: David Roberts
The BioCASe Monitor Service (BMS) is a web-based tool for coordinators of distributed data networks that provide information to web-portals and data aggregators via the BioCASe Provider Software. Building on common standards and protocols, it has three main purposes: (1) monitoring provider’s progress in data provision, (2) facilitating checks of data mappings with a focus on the structure, plausibility and completeness, and (3) verifying compliance of provided data for transformation into other target schemas.
Herein two use cases, GBIF-D and OpenUp!, are presented in which the BMS is being applied for monitoring the progress in data provision and performing quality checks on the ABCD (Access to Biological Collection Data) schema mapping.
However, the BMS can potentially be used with any conceptual data schema and protocols for querying web services. Through flexible configuration options it is highly adaptable to specific requirements and needs. Thus, the BMS can be easily implemented into coordination workflows and reporting duties within other distributed data network projects.
In international biodiversity data initiatives a common goal is to build up distributed network infrastructures, e.g. in the
There are agreed data (exchange) standards and imposed required mandatory concepts depending on the focus and context of the data providing network. The project coordination is then also responsible for the quality assurance of the data provided and is obligated to continuously check the compliance and consistency of the data sources.
Furthermore, distributed networks (especially Europeana;
Generally, technical staff and end-users would benefit from a service that gives an overview of the data providers and indicates that the participating provider services are on-line.
This paper presents a service tool, the
The BMS has been developed through a collaborative effort between two data projects (
In distributed biodiversity data networks the individual providing institutions (providers) manage their data supply by installing a technical infrastructure (e.g. a middleware) on top of their own databases. This is done in order to allow one or several Services or aggregating web portals to access the data via a central interface. Data are directly retrieved from the database located at provider side without resorting to a centralized public architecture for storage. With this approach, providers keep control over their data provision and are flexible in assigning institution-specific data policies. The middleware creates an abstraction layer by mapping the original data model (software- or institution-specific) to a common domain-specific (exchange) schema, e.g. the ABCD schema (
In this step the provider can define the information flow by filtering the fields of the source database that are relevant concepts for the target network. The abstraction layer can then be directly harvested by domain specific harvesting tools, such as the GBIF
In some projects or initiatives this step is iterated, for example if a transformation performed on an intermediate level is necessary for data provision in the domain-specific format to a more general (not domain-specific) structure. This facilitates data indexing and aggregation for web-portals and services, which publish the data for the end-users.
A complex data flow such as this is applied in the
The two biodiversity projects
The
Furthermore, the relatively small effort in setting up the BPS enables a wide range of possibilities for data exchange, because multiple software products are able to harvest and interpret the same data sources. In the example of GeoCASE, GBIF harvests the paleontological, but not the geological data.
Official standard schemas and ontologies are designed and ratified by the scientific community i.e. the
The subsequent paragraphs briefly describe the data schemas which are mostly used in natural history context and are ordinarily supported by the BPS:
The ABCD schema (
The ABCD
In order to provide DNA sample data together with their specimen data via the ABCD schema, generic concepts for supplementary contents (‘MeasurementsOrFacts’; see
DarwinCore (often abbreviated DwC) is a set of elements from different ontologies and schemas (e.g. Dublin Core;
A stand-alone format, the DarwinCore Archive, is a self-described DarwinCore file. It is intended to ease the cataloguing of big datasets by processing them without requiring a live connection to the provider. This format is also useful for publications, because it can create a citable snapshot of a dataset. The BPS is able to convert ABCD data into the DarwinCore Archive format.
The BPS communicates via its native query protocol (BioCASe Protocol; currently in version 1.3;
In addition, the protocol can combine the data with information on the operating system and the executed database queries, as well as warnings and error messages. Thus, it reports about the communication with the database and the status to give a comprehensible feedback on what is done in the background. This is particularly beneficial in the case where a malfunction needs to be debugged.
The BioCASe Monitor Service (BMS) is a web-based tool programmed in PHP (
The BMS consists of two interfaces: 1) the
The BioCASe Monitor is the entry point interface of the BMS. It consists of informative tables for each registered provider. These tables contain, at the minimum, a list of data sources, their access points (URI of the particular BPS data source), total number of records and date of last modification. For any concept of the provided data schema, a column can be displayed with the count of total and distinct values (see Fig.
The greater the number of data sources the greater is the necessity to group them into logical units. Therefore, the BioCASe Monitor offers the possibility of creating blocks of data sources, which can be configured as collapsible (and respectively expandable) boxes (see Fig.
In addition, a customizable layout enables a free design for each block or table via
Loading the BioCASe Monitor\'s provider overview can be time consuming depending on the number of data sources and requested concepts, but also due to long response time of the addressed servers. Eventually, this can result in a server timeout. In order to avoid this, the BioCASe Monitor sends requests asynchronously via AJAX (
In some cases (e.g. while dealing with big databases or focusing on a specific part of a dataset) it might be necessary to consider just a fraction of the data in a particular data source or to list subsets separately in the BioCASe Monitor overview. For this purpose, the BMS offers the possibility to pass filter criteria to the BioCASe Protocol, which allows fragmentation of a data source by defining different filters on the same access point. This feature increases the flexibility of coordinators to list desired metadata in a more structured way. Furthermore, it improves the performance when requesting large data sources. The link "View mapping" available for each data source refers to the second web interface of the BMS, the
The Mapping Checker in the BioCASe Monitor Service lists the available XML elements of the mapped schema of a single data source. It also displays more details on each field e.g. the x-path, which identifies a particular concept, the expected data type of the schema and some sample value (Fig.
The BPS can display values originating from a SQL database or from text directly inserted in the mapping. However, only values coming directly from an atomized database field can be filtered, which is flagged by the boolean attribute “searchable” in a data source’s capabilities. Thus, it is important to check the searchability of a given concept while debugging a mapping or identifying unexpected results of a request. The Mapping Checker facilitates this by simply requesting each element’s status and listing it in a column (see Fig.
Plausibility checks of a mapping have to include the verification step that a XML response has the correct content. This is necessary, because an ABCD response can be technically correct, while the XML elements and attributes do not correspond to the schema definition regarding the contents. In the BioCASe Mapping Checker sample values of the first records of the XML response are displayed along with the corresponding concept. That way the user gets an impression of values of specific concepts in the data source. This feature is essential for quality control by the domain expert. A more detailed insight into the concept\'s values is provided by a hyperlink in the x-path column. By clicking this link, a scan request is performed in order to display all unique values of the respective concept (see Fig.
Data aggregators can use different XML schemas, depending on the purpose and scope of the service consuming them. This often implies a transformation between the provider’s original XML schema and a target schema. An example is Europeana (see Fig.
The options provided with the Mapping Checker focus on single data sources, because it is designed as a stand-alone service interface totally independent from the BioCASe Monitor. It can be used for any given BioCASe data source URI that is transferred via a parameter in the service’s endpoint URL (e.g.
The output of the BioCASe Mapping Checker is HTML by default, but XML output is supported alternatively. This makes the output suitable for further processing by other software (e.g. R-scripts for more elaborated statistics;
In the OpenUp! project, the BMS is used for the overall organization and monitoring of the progress (
In OpenUp!, the main focus of data provision is on making multimedia objects and associated metadata accessible through the Europeana platform. Therefore, the counts of total multimedia resource references, which are represented as URIs (and thus called
Furthermore, the BioCASe Monitor offers the possibility to export the compiled numbers and information in a tab-delimited format. This is especially helpful for the production of the regular reports in the project (for example
Another main task in the OpenUp! project is the assurance of the quality standards defined in the project and imposed by the use of the ABCD standard, as well as the assurance of the compliance of the provided data with the target schema
The completeness of the mapping in terms of mandatory concepts for the harvesting, but also the transformation procedure, is automatically checked for a default (zoology and botany), a mineralogy and a paleontology setting. This check is done against the transformation rules agreed upon in the project (
The flexibility in the implementation of the BMS into existing workflows has been proven in the OpenUp! project by finding a solution for a particular issue: ABCD is centered on two concepts, the metadata of a single collection (represented in the element ‘Dataset’;
The Mapping Checker allows not only the coordinator to perform the final validation before the official release for a test harvest by the
After the end of the funding period of the project, providers will be able to use the BioCASe Mapping Checker as a tool facilitating the correct setup of their data sources according to the Europeana standards. In addition, the BioCASe Monitor will show the intended increase of objects provided to Europeana as well as the expansion of the provider network.
In Germany data mobilization and provision to the international GBIF network is coordinated by the GBIF-D project (
In the BioCASe Monitor instance of GBIF-D (
A column for useful links has been created to manually enrich rows per data source with additional hyperlinks. For each column only a variable in the configuration file needs to be filled with the desired URL in order to create a new hyperlink. Thus, the overview can be supplemented with any related references. For example, once the data is harvested by GBIF international, there is a reference available (e.g.
As
The two described use cases demonstrate the wide ranging possibilities of the BioCASe Monitor Service for coordinators and providers in distributed data networks. The tool facilitates plausibility and compliance checks by providing flexible compilations of metadata and numbers. For regular reporting duties in distributed network projects, metadata can simply be exported or copied to be pasted into the respective document. Depending on the user\'s demands, more export formats could be added at a later stage (e.g. Microsoft Excel).
Due to the projects’ premises and contexts for which the BMS has been created, the focus is primarily on the usage with the ABCD schema and its extensions. However, the BioCASe Monitor Service is an easily adaptable tool and thus can be applied also for other schemas (e.g. DarwinCore) used in the area of (biodiversity) informatics. As the core part is the BioCASe Protocol, the same standard format is used for communication with the web services. Consequently, there are no direct limitations regarding different schemas as long as they are implemented in the BioCASe Provider Software. It is planned to undertake some tests to demonstrate the technical feasibility of this.
The configuration of the Monitor Service is organized in a plain text file by using the easily comprehensible INI format (
The feature of having a caching mechanism in the BioCASe Monitor is very effective for both, the reduction of superfluous traffic at the provider’s end point and the performance improvement of the Monitor Service. Simple text files are used for storing the cached values in order to avoid the requirement of having an additional database. These cache files, which are tagged by a timestamp, will be re-used for a planned module that will create graphics and diagrams of the progress of data provision automatically. This more sophisticated feature is expected to be a perfect addition to the service regarding the implementation in the regular project reports.
It can be concluded that the BioCASe Monitor Service fills a gab in the management of distributed (biodiversity) data networks as it facilitates many workflows concerning the coordination of the providers and monitoring their progress in data provision. It is open for adaptations and functionality extensions for both, the network members and the end-users of data portals.
The OpenUp! project is a Best Practice Network, co-funded by the European Commission under the eContentplus programme, as part of the i2011 policy. GBIF-D is funded as a joint research project by the German Federal Ministry of Education and Research (BMBF).
Wiki:
Download page:
Platform: platform independent web-server with PHP
Programming language: PHP, JavaScript, HTML, CSS
Interface language: English
Type: SVN
Browse URI:
Creative Commons CCZero
The architecture in distributed networks using the BioCASe Provider Software (BPS). This is illustrated just for one of many providers (left box) in the distributed network. The BioCASe Monitor Service (BMS) is used for checking the data compliance and requirements prior to the harvesting by the indexing tool (HIT). For the data provision to Europeana an additional transformation from the ABCD or ABCDEFG schema to the Europeana schema (ESE or EDM) is necessary. The requirements for the transformation are checked by the BMS at the provider side.
Each concept can be flexibly displayed with the total and distinct count of values.
Collapsible or expandable boxes for grouping several data source entries (e.g. of the same provider) in the BioCASe Monitor.
Examples for flexible CSS layouts of the BioCASe Monitor interface.
The BioCASe Mapping Checker with source concept x-path, searchability status, data type, target element for transformation check, example values, total and distinct count of values, count of dropped values and information on constraints of the target schema.
XML response of a scan request for taxon names in a BioCASe data source.