Querying the Web of Data with Graph Theory-based Techniques

Querying the Web of Data with Graph Theory-based Techniques
Querying the Web of Data with Graph Theory-based Techniques
Bibliographical Metadata
Subject:	Querying Distributed RDF Data Sources
Keywords:	SPARQL, Linked Data, distributed query processing.
Year:	2006
Authors:	Xin Wang, Thanassis Tiropanis, Hugh C. Davis
Venue	Web and Internet Science
Content Metadata
Problem:	SPARQL Query Federation
Approach:	Graph theory-based techniques
Implementation:	GDS
Evaluation:	Performance and scalability evaluation

Abstract

The increasing amount of Linked Data on the Web enables users to retrieve quality and complex information and to deploy innovative, added-value applications. The volume of available Linked Data and their spread across a large number of repositories make a strong case for ecient distributed SPARQL queries. However, in practice, current distributed SPARQL query processing techniques face issues on performance and scalability. In our previous work we provided initial evidence that graph theory-based techniques can address performance issues better than other approaches such as DARQ. Here we further exploit the potential of graph algorithms and we show how they can address performance and scalability for distributed SPARQL queries even better. To that end, we present an improved engine called GDS and we evaluate it by providing a detailed comparison to existing approaches for distributed queries (i.e. DARQ and FedX). By analyzing the evaluation results, we try to identify promising techniques for distributed SPARQL processing, and to outline the problems that need to be addressed in future research.Property "Has abstract" (as page type) with input value "The increasing amount of Linked Data on the Web enables users to retrieve quality and complex information and to deploy innovative, added-value applications. The volume of available Linked Data and their spread across a large number of repositories make a strong case for ecient distributed SPARQL queries. However, in practice, current distributed SPARQL query processing techniques face issues on performance and scalability. In our previous work we provided initial evidence that graph theory-based techniques can address performance issues better than other approaches such as DARQ. Here we further exploit the potential of graph algorithms and we show how they can address performance and scalability for distributed SPARQL queries even better. To that end, we present an improved engine called GDS and we evaluate it by providing a detailed comparison to existing approaches for distributed queries (i.e. DARQ and FedX). By analyzing the evaluation results, we try to identify promising techniques for distributed SPARQL processing, and to outline the problems that need to be addressed in future research." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Conclusion

In this paper we present a novel graph theory-based approach for improved performance and scalability of distributed SPARQL query processing. We propose several key techniques adopted in the distributed SPARQL query engine that we developed (GDS). First, an extended MST algorithm which supports arbitrary SPARQL queries and provides better performance. Second, a simplified cost model using run-time statistics that lows requirement of service descriptions but provides good cost estimations. Third, a combination of semi-join and bind-join along with local cache that reduce network tracffic. We also compare our approach with DARQ and FedX in terms of performance and scalability. The results suggest that graph theory-based approach using lightweight service descriptions can provide better performance and scalability over other approaches Although these results are encouraging, the potential of graph theory-based approach can be developed further. First, GDS applies MST-based algorithm to BGPs rather than the whole query. BGPs are optimized separately even they are from the same query (i.e. UNION and OPTIONAL queries). In the future we aim to work on representing the whole query as a single graph and therefore provide better optimization for queries having multiple BGPs. Second, GDS does not take advantage of FILTER in optimization, which would further improve performance. Third, accurate and detailed service descriptions matters. The more accurate statistics we have, the more optimal query plan we get. Currently collecting quality service descriptions are not feasible on a large scale since most SPARQL endpoints do not provide service descriptions. However, this situation is improving as more and more approaches coming up. For example, SPARQL 1.1 aggregation features enable us to collect service descriptions more efficiently, and VoiD encourages SPARQL endpoints to publish service descriptions as well. Fourth, aggregation features in the upcoming SPARQL 1.1 (e.g. BINDINGS) can also save much efforts of our approach. In addition to these improvements, we are planning to explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published , and we plan to address this issue by using our Virtual Graph approach.Property "Has conclusion" (as page type) with input value "In this paper we present a novel graph theory-based approach for improved performance and scalability of distributed SPARQL query processing. We propose several key techniques adopted in the distributed SPARQL query engine that we developed (GDS). First, an extended MST algorithm which supports arbitrary SPARQL queries and provides better performance. Second, a simplified cost model using run-time statistics that lows requirement of service descriptions but provides good cost estimations. Third, a combination of semi-join and bind-join along with local cache that reduce network tracffic. We also compare our approach with DARQ and FedX in terms of performance and scalability. The results suggest that graph theory-based approach using lightweight service descriptions can provide better performance and scalability over other approaches Although these results are encouraging, the potential of graph theory-based approach can be developed further. First, GDS applies MST-based algorithm to BGPs rather than the whole query. BGPs are optimized separately even they are from the same query (i.e. UNION and OPTIONAL queries). In the future we aim to work on representing the whole query as a single graph and therefore provide better optimization for queries having multiple BGPs. Second, GDS does not take advantage of FILTER in optimization, which would further improve performance. Third, accurate and detailed service descriptions matters. The more accurate statistics we have, the more optimal query plan we get. Currently collecting quality service descriptions are not feasible on a large scale since most SPARQL endpoints do not provide service descriptions. However, this situation is improving as more and more approaches coming up. For example, SPARQL 1.1 aggregation features enable us to collect service descriptions more efficiently, and VoiD encourages SPARQL endpoints to publish service descriptions as well. Fourth, aggregation features in the upcoming SPARQL 1.1 (e.g. BINDINGS) can also save much efforts of our approach. In addition to these improvements, we are planning to explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published , and we plan to address this issue by using our Virtual Graph approach." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Future work

Explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published, and we plan to address this issue by using our Virtual Graph approach.

Approach

Positive Aspects: {{{PositiveAspects}}}Property "Has PositiveAspects" (as page type) with input value "{{{PositiveAspects}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Negative Aspects: {{{NegativeAspects}}}Property "Has NegativeAspects" (as page type) with input value "{{{NegativeAspects}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Limitations: {{{Limitations}}}Property "Has Limitations" (as page type) with input value "{{{Limitations}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Challenges: {{{Challenges}}}Property "Has Challenges" (as page type) with input value "{{{Challenges}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Proposes Algorithm: {{{ProposesAlgorithm}}}Property "Proposes Algorithm" (as page type) with input value "{{{ProposesAlgorithm}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Methodology: {{{Methodology}}}Property "Uses Methodology" (as page type) with input value "{{{Methodology}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Requirements: {{{Requirements}}}Property "Has Requirements" (as page type) with input value "{{{Requirements}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Limitations: {{{Limitations}}}Property "Has Limitations" (as page type) with input value "{{{Limitations}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Implementations

Download-page: http://code.google.com/p/gdsparal/

Access API: -

Information Representation: RDF

Data Catalogue: Service Description

Runs on OS: OS independent

Vendor: -

Uses Framework: Jena

Has Documentation URL: http://code.google.com/p/gdsparal/

Programming Language: Java

Version: 1.0

Platform: Jena

Toolbox: -

GUI: No

Research Problem

Subproblem of: Querying Distributed RDF Data Sources

RelatedProblem: transparent query federation

Motivation: No data available now.

Evaluation

Experiment Setup: The three engines are run independently on a machine having an Intel Xeon W3520 processor, 12 GB memory and 1Gbps LAN.Property "Has ExperimentSetup" (as page type) with input value "The three engines are run independentlyon a machine having an Intel Xeon W3520 processor, 12 GBmemory and 1Gbps LAN." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Evaluation Method : comparing the three distributed SPARQL engines

Hypothesis: -

Description: We deploy 6 SPARQL endpoints (Sesame 2.4.0) on 5 remote virtual machines. About 400,000 triples (generated by BSBM) are distributed to these endpoints following Gaussian distribution. We follow the metrics presented in (23). For each query, we calculate the number of queries executed per second (QPS) and average results count. For the whole test, we record the overall runtime, CPU usage, memory usage and network overhead. We perform 10 warm up runs and 50 testing runs for each engine. Time out is set to 30 seconds. In each run, only one instance of each engine is used for all queries, but cache is cleared after finishing each query. Warm up runs are not counted in query time related metrics, but included in system and network overhead.Property "Has Description" (as page type) with input value "We deploy 6 SPARQL endpoints (Sesame 2.4.0) on 5 remotevirtual machines. About 400,000 triples (generated byBSBM) are distributed to these endpoints following Gaussiandistribution.We follow the metrics presented in (23). For each query, wecalculate the number of queries executed per second (QPS)and average results count. For the whole test, we record theoverall runtime, CPU usage, memory usage and networkoverhead.We perform 10 warm up runs and 50 testing runs for eachengine. Time out is set to 30 seconds. In each run, onlyone instance of each engine is used for all queries, but cacheis cleared after finishing each query. Warm up runs are notcounted in query time related metrics, but included in systemand network overhead." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Dimensions: Performance

Benchmark used: FedBench, Berlin SPARQL Benchmark (BSBM)

Results: We divide evaluation results of GDS, FedX and DARQ into three categories. First is query performance related metrics. Limited by space, we only provide QPS in this paper7. Second is system loads including CPU usage and memory usage. Third is network overhead including uploading data and downloading data. Here we especially compare GDS with FedX, because DARQ fails providing correct results or is time out on many queries.Property "Has Results" (as page type) with input value "We divide evaluation results of GDS, FedX and DARQ intothree categories. First is query performance related metrics.Limited by space, we only provide QPS in this paper7.Second is system loads including CPU usage and memoryusage. Third is network overhead including uploading dataand downloading data. Here we especially compare GDSwith FedX, because DARQ fails providing correct results oris time out on many queries." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.