Querying Distributed RDF Data Sources with SPARQL

Querying Distributed RDF Data Sources with SPARQL
Querying Distributed RDF Data Sources with SPARQL
Bibliographical Metadata
Subject:	Querying Distributed RDF Data Sources
Year:	2008
Authors:	Bastian Quilitz, Ulf Leser
Venue	ESWC
Content Metadata
Problem:	SPARQL Query Federation
Approach:	decompose a query into sub-queries, each of which can be answered by an individual service.
Implementation:	DARQ
Evaluation:	Evaluate the performance of the DARQ query engine.

Abstract

DARQ provides transparent query access to multiple SPARQL services, i.e., it gives the user the impression to query one single RDF graph despite the real data being distributed on the web. A service description language enables the query engine to decompose a query into sub-queries, each of which can be answered by an individual service. DARQ also uses query rewriting and cost-based query optimization to speed-up query execution.

Conclusion

DARQ offers a single interface for querying multiple, distributed SPARQL end-points and makes query federation transparent to the client. One key feature of DARQ is that it solely relies on the SPARQL standard and therefore is compatible to any SPARQL endpoint implementing this standard. Using service descriptions provides a powerful way to dynamically add and remove endpoints to the query engine in a manner that is completely transparent to the user. To reduce execution costs we introduced basic query optimization for SPARQL queries. Our experiments show that the optimization algorithm can drastically improve query performance and allow distributed answering of SPARQL queries over distributed sources in reasonable time. Because the algorithm only relies on a very small amount of statistical information we expect that further improvements are possible using techniques. An important issue when dealing with data from multiple data sources are differences in the used vocabularies and the representation of information. In further work, we plan to work on mapping and translation rules between the vocabularies used by different SPARQL endpoints. Also, we will investigate generalizing the query patterns that can be handled and blank nodes and identity relationships across graphs.

Future work

In further work, we plan to work on mapping and translation rules between the vocabularies used by different SPARQL endpoints. Also, we will investigate generalizing the query patterns that can be handled and blank nodes and identity relationships across graphs.

Approach

Positive Aspects: Query rewriting and cost-based query optimization to speed-up query execution.

Negative Aspects: {{{NegativeAspects}}}

Limitations: {{{Limitations}}}

Challenges: {{{Challenges}}}

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: {{{Requirements}}}

Limitations: {{{Limitations}}}

Implementations

Download-page: http://darq.sf.net/

Access API: {{{API}}}

Information Representation: RDF

Data Catalogue: Service Description

Runs on OS: Linux SunOS 5.10

Vendor: Open Source

Uses Framework: ARQ

Has Documentation URL: http://darq.sf.net/

Programming Language: Java

Version: 1.0

Platform: Jena

Toolbox: No data available now.

GUI: No

Research Problem

Subproblem of: Querying Distributed RDF Data Sources

RelatedProblem: transparent query federation

Motivation: {{{Motivation}}}

Evaluation

Experiment Setup: we split all data over two Sun-Fire-880 machines (8x sparcv9 CPU, 1050Mhz, 16GB RAM) running SunOS 5.10. The SPARQL endpoints were provided using Virtuoso Server 5.0.37 with an allowed memory usage of 8GB . Note that, although we use only two physical servers, there were five logical SPARQL endpoints. DARQ was running on Sun Java 1.6.0 on a Linux system with Intel Core Duo CPUs, 2.13 GHz and 4GB RAM. The machines were connected over a standard 100Mbit network connection.

Evaluation Method : evaluate the performance of the DARQ query engine.

Hypothesis: -

Description: In this section we evaluate the performance of the DARQ query engine. The prototype was implemented in Java as an extension to ARQ5. We used a subset of DBpedia6. DBpedia contains RDF information extracted from Wikipedia. The dataset is offered in different parts.

Dimensions: Performance

Benchmark used: subset of DBpedia.

Results: The experiments show that our optimizations significantly improve query evaluation performance. For query Q1 the execution times of optimized and unoptimized execution are almost the same. This is due to the fact that the query plans for both cases are the same and bind joins of all sub-queries in order of appearance is exact the right strategy. For queries Q2 and Q4 the unoptimized queries took longer than 10 min to answer and timed out, whereas the execution time of the optimized queries is quiet reasonable. The optimized execution of Q1 and Q2 takes almost the same time because Q2 is rewritten into Q1.