Corresponding author: Carlo Allocca (
Academic editor: Viktor Senderov
A large percentage of scientific data with tabular structure are published on the Web of Data as interlinked RDF datasets. When we come to the issue of long-term preservation of such RDF-based digital objects, it is important to provide full support for reusing them in the future. In particular, it should include means for both players who have no familiarity with RDF data model and, at the same time, who by working only with the native format of the data still provide sufficient information. To achieve this, we need mechanisms to bring the data back to their original format and structure.
In this paper, we investigate how to perform the reverse process for column-based data sources. In particular, we devise an algorithm, RML2CSV, and exemplify its implementation in transforming an RDF dataset into its CSV tabular structure, through the use of the same RML mapping document that was used to generate the set of RDF triples. Through a set of content-based criteria, we attempt a comparative evaluation to measure the similarity between the rebuilt CSV and the original one. The results are promising and show that, under certain assumptions, RML2CSV reconstructs the same data with the same structure, offering more advanced digital preservation services.
To date, a large percentage of scientific data published on the Web of Data (
Based on a set of content-based criteria to measure the similarity between the original data source and the one reconstructed by RML2CSV, we evaluate the approach over a collection of real-world RDF datasets from Biodiversity domain available in the MedObis repository (
The paper continues as in the following: The
First, we briefly introduce the R2RML and RML ([R2]RML) mapping languages to the extent at which it concerns with our preliminary investigation (see (
To face with the high expressivity of RML's mapping language and to monitor the complexity of the
1. given a mapping rule tmi, the
2. given a mapping rule tmi, the
3. given a mapping rule tmi, a
4. if tm1,...,tmn are triples maps of the same RML Mapping Document and defined according to 1-3, they all refer to the same CSV data source.
Basically, RML Lite allows only the mapping of CSV
When using RML to perform such a task for a CSV data source (CSV2RDF), it means to write down a set of rules (stored as [R2]RML
Conversely, RDF2CSV - the task of rebuilding the structure and the instances of the CSV data source from the RDF dataset - works in opposite direction: the RML rules are used to rebuild the column-based structure and populate it with the data from the RDF dataset. To exemplify, the rule <#Dataset>, when applied for the reverse process, retransforms the instances of the class: Dataset into values of the column datasetID.
What we have produced so far are only two dimensions (the columns and the cells) out of the three (the columns, the cells and the rows) that characterize a CSV data model.
We noticed that the root of this problem may lie in the fact that potential relationships between columns in the CSV data source are not expressed at the conceptual level through the mapping rules. As shown in Fig.
Based on such observation, we asked how we can make sure that we deal with types of scenario exemplified in Fig.
Informally, G will have (a) only one vertice,
An example of CSV data source that satisfies the ICR assumption is showed in Fig.
Once the DTA and ICR are satisfied, the set of RML rules contains all the required information to rebuild the content, row by row, header included. In particular, each rule provides details such as the SubjectMap and PredicateObjectMap that connects two rules (e.g the predicate: consistsOf connects <#Dataset> with <#SamplingActivity>). Taking advantage of such structures, one way to build back a specific row is to exploit the set of rules from the most generic one to the most specific ones. Using a tree nomenclature, it means to visit the n-ary tree from the root to the leaves. We repeat this step for all the values that are instances of the root SubjectMap's Class. To exemplify the main idea, let us consider the RDF dataset and the set of rules of Fig.
To compute automatically such a process we devised a
INPUT: 1) a set of RML mapping rules S 2) an RDF Dataset d.
OUTPUT: 1) a CSV file.
1: procedure RML2CSV (S, d)
2: reversedCSV[] ← empty; //List of reversed rows.
3: dT ← IdentifyTheMostGenericRMLrule(S); //dT ← the root node.
4: distinctSubjects[] ← SelectDistinctSubject(dt.getClassURI(), d);
5: for each subji in distinctSubjects[] do
6: partRevRowi [] ← empty; //List of rowItem.
7: currPred ← empty; //a predicate of an RDF triple.
8: reversedRowi[] ← ReverseRow(S, subji , partRevRowi[], currPred, dT, d);
9: reversedCSV[].add(ReversedRowi[]);
10: Export reversedCSV[] as a csv text file.
The work has been supported by the LifeWatchGreece project, funded by GSRT, ESFRI Research Infrastructures, Structural Funds, OPCE II (Act Code: 384676).
Download page:
Programming language: Java
Operational system: Windows or Linux or Mac
Interface language: Java
Type: Git
Browse URI:
Module: packages gr.hcmr.imbbc.rmlreverse.
Creative Commons CCZero
We have implemented RML2CSV on top of RML, based on the fact that, in comparison to the other approaches, it provides a uniform way to access different types of data sources such as CSV, XML, JSON and DB. Consequently, we believe that enabling the corresponding reverse processes within the same framework it would not only strengthen the latter but also make it to be used by a much larger community, as well as to extend it to support other type of data source, beyond CSV. The current implementation of the RML2CSV can be found at
The general goal of evaluating RML2CSV is to answer the following (related) questions: 1. Does it solve the problem that is supposed to? 2. Does it work correctly under all the assumptions? To answer such questions, we designed a set of content based criteria to estimate the extent to which the reversed data source (csvr) overlaps, row by row, with the original one (csvo). To this end, we based such a comparison on computing a similarity measure between csvr and csvo, as expressed in the following:
where the contentDistance intends to measure the number of rows and the extent to which they contain the same information. It is defined as in the following:
where m is the number of rows of the csvo, rowir is computed by CorrRow(rowio) which is a function to calculate the corresponding i-th row in the reversed CSV and, the rowDistance measures the number of cells and the extent to which they contain the same values. It is defined as in the following:
where n is the length of rowio , cellsir is computed by CellRow(cellio) which is a function to calculate the corresponding i-th cell in the reversed CSV and the
Combining (1), (2) and (3) together we have that: if (3) is always equal to 0, meaning that anytime we compare two rows they always contain the same values, then (2) is equal 0, meaning that csvr and csvo contain the same content. In this case, (1) would measure a similarity equal to 1. On the contrary, if (3) is always equal to 1, meaning that anytime we compare two rows they always contain different values, then (2) is equal 1, meaning that csvr and csvo contain different content. In this case, (1) would measure a similarity equal to 0. To face with the
The current evaluation is based on a collection of five CSV data sources from Biodiversity domain, containing mainly occurrence data from the MedOBIS (Biogeographic information system for the eastern Mediterranean and Black Sea (Arvanitidis et al. 2006)). They are characterized by a different column-based structure containing from 4 to 12 columns (e.g. datasetID, language, fieldNumber, different types of measurements just to report a few). Before transforming them into RDF datasets we applied a pre-preprocessing to make sure that their content would not generate any of the issues analyzed in the
As it can be noticed, RML2CSV reconstructed all the five CSVs with a content up to 100% overlapped with the original ones. This very initial evaluation does not pretend to demonstrate the correctness or completeness of proposed approach, but it posed the base and encourage us for a thorough evaluation of the RML2CSV efficiency and effectiveness.
We designed and implemented our algorithm, RML2CSV, taking into account the DTA and the ICR assumptions. Now, we discuss how to build upon the current achievemnts in order to suggest solutions for relaxing the two assumptions.
To the best of our knowledge, there is no other study investigating the reversing of an RDF dataset for reconstructing the original tabular data source of CSV type. On the contrary, several solutions exist to execute mappings from different types of data sources and serialisations to the RDF data model. The R2RML W3C recommendations (
In particular,
Similarly (
Unfortunately, all these existing approaches are rather limited for our scenario either because they do not consider the reverse problem at all or because they face it in different context and targetting diverse goal. While they contribute interesting elements for us to build on, we focus here on how to perform the reverse process for the case of column-based structured data source of CSV type w.r.t its original data structure and not any. Furthermore, as our solution is based on [R2]RML mapping language, it provides the additional advantage that we can perform both transactions, CSV data source to RDF dataset and vice-versa, within the same framework, that none of the discussed work does.
In this paper we argue that an important aspect of long-term preservation of digital objects, such RDF datasets, is to provide full support for reusing such data, including mechanisms to bring back the data to their original format. To achieve this, in this work we investigated on how to perform the reverse process for the case of column-based data source such as tabular data. In particular, we devised an algorithm, the RML2CSV, for transforming an RDF dataset into its original data structure, through the use of the same RML mapping rules used to generate the set of RDF triples. The results of the evaluation showed that RML2CSV rebuilds the same data content with the same data structure under certain assumptions.
In the future, a thorough evaluation of RML2CSV efficiency will be performed. In addition, we have planned to extend RML2CSV to dealwith any type of constraints between columns (e.g. 1:n and, more general m:m) as discussed in
Algorithm 2 Reversing a single CSV row from an RDF Dataset through the use of RML mappings.
1: procedure ReverseRow(subji, partRevRowi [], currP red, dT, d)
2: currentRowContent[] ← partRevRowi [];
3: if dT = null then
4: currSubjectValue ← subji;
5: if dT.PredicateObjectMaps[] = null ∧ dT is not the root node then
6: return currentRowContent[];
7: if dT.PredicateObjectMaps[] = null ∧ dT is the root node then
8: termName ← dT.SubjectMap.getTemplate.localName;
. we need to check if SubjectMap has a template
9: rowItem ← TermName@currSubjectValue;
10: currentRowContent[].add(rowItem);
11: return currentRowContent[];
12: else
13: for i = 1 to dT .P redicateObjectM aps[].length do
14: currP red ← dT.PredicateObjectMap[i].getPredicate;
15: objectMap ← dT.PredicateObjectMap[i].getObjectMap;
16: if objectMap contains a parentTripleMap then
17: parentTriplesMap ← objectMap.getParentTriplesMap();
18: tripleMapName ← parentTriplesMap.getName();
19: nextTripleMap ← Search(dT, tripleMapName);
20: termName ← nextTripleMap.SubjectMap.getTemplate.localName;
. we need to check if SubjectMap has a template
21: className ← nextNodeToExplore.SubjectMap.getClass;
22: nextSubject ← SelectDistinctObject(d, currSubjectValue, currPred, className);
. SPARQL query where (currSubjectValue, currPred, ?object) and (?object rdf:type className)
23: cellItem ← termName@nextSubject;
24: currentRowContent[].add(cellItem);
25: ReverseRow(nextSubject, currentRowContent[], currPred, nextTripleMap, d);
26: if dT is the root then
27: termName ← dT.SubjectMap.getTemplate.localName;
28: rowItem ← termName@subji;
29: currentRowContent[].add(rowItem);
. if it is not already added
30: if objectMap contains a rr:reference then
31: print(”Not detailed for space reason.”);
32: return currentRowContent[];
33: else return null;
The authors would like to thank Nicolas Bailly for fruitful discussions and for providing valuable input to the issues of this article and Anastasia Dimou for contributing tremendously to the paper even though declining co-authorship. This research has been supported by LifeWatchGreece project (Funded by GSRT, EFRI Projects, Structural Funds, OPCE II).
An example of CSV data source exposed as an RDF dataset using a set of RML rules.
An example of output when reversing independent mappings.
Making explicit potential associations (guided by the target schema MarineTLO (
The graph structures underlying the RML rules of Figs
Extention of the example of Fig.
The results of comparing csv o with csv r (Suppl. material
Exposing a CSV data source with 1:n implicit constraints using RML rules Fig.
An example of a very “low” quality mappings.
An example of a key value column.
RML2CSV evaluationn
Data type: RML2CSV evaluation data
File: oo_45207.xlsx