SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions

SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions
SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions
Bibliographical Metadata
Subject:	Querying Distributed RDF Data Sources
Year:	2011
Authors:	Olaf Gorlitz, Steffen Staab
Venue	COLD
Content Metadata
Problem:	SPARQL Query Federation
Approach:	query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions
Implementation:	SPLENDID
Evaluation:	query execution performance evaluation

Abstract

In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions.Property "Has abstract" (as page type) with input value "In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniquesfor RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Conclusion

SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.Property "Has conclusion" (as page type) with input value "SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data sourceselection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations.In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual queryexecution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Future work

As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.Property "Has future work" (as page type) with input value "As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Approach

Positive Aspects: {{{PositiveAspects}}}Property "Has PositiveAspects" (as page type) with input value "{{{PositiveAspects}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Negative Aspects: {{{NegativeAspects}}}Property "Has NegativeAspects" (as page type) with input value "{{{NegativeAspects}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Limitations: {{{Limitations}}}Property "Has Limitations" (as page type) with input value "{{{Limitations}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Challenges: {{{Challenges}}}Property "Has Challenges" (as page type) with input value "{{{Challenges}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Proposes Algorithm: {{{ProposesAlgorithm}}}Property "Proposes Algorithm" (as page type) with input value "{{{ProposesAlgorithm}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Methodology: {{{Methodology}}}Property "Uses Methodology" (as page type) with input value "{{{Methodology}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Requirements: {{{Requirements}}}Property "Has Requirements" (as page type) with input value "{{{Requirements}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Limitations: {{{Limitations}}}Property "Has Limitations" (as page type) with input value "{{{Limitations}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Implementations

Download-page: https://github.com/semagrow/fork-splendid-server

Access API: No data available now.

Information Representation: RDF

Data Catalogue: VoID

Runs on OS: OS independent

Vendor: Open source

Uses Framework: -

Has Documentation URL: No data available now.

Programming Language: Java

Version: 1.0

Platform: Sesame

Toolbox: No data available now.

GUI: No

Research Problem

Subproblem of: Querying Distributed RDF Data Sources

RelatedProblem: retrieve and join the result tuples

Motivation: {{{Motivation}}}Property "Has motivation" (as page type) with input value "{{{Motivation}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Evaluation

Experiment Setup: Due to the unpredictable availability and latency of the original SPARQL endpoints of the benchmark dataset we used local copies of them which were hosted on five 64bit Intel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with each instance providing the SPARQL endpoint for one life science and for one cross domain dataset. The evaluation was performed on a separate server instance with 64bit Intel(R) Xeon(TM) CPU 3.60GHz and a 100Mbit network connection.Property "Has ExperimentSetup" (as page type) with input value "Due to the unpredictable availability and latency of the original SPARQL endpointsof the benchmark dataset we used local copies of them which were hosted on five 64bitIntel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with eachinstance providing the SPARQL endpoint for one life science and for one cross domaindataset. The evaluation was performed on a separate server instance with 64bit Intel(R)Xeon(TM) CPU 3.60GHz and a 100Mbit network connection." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Evaluation Method : The goal of the evaluation is to show that SPLENDID is able to achieve good query execution performance for real world federation scenarios.

Hypothesis: -

Description: we investigated how the information from the VOID descriptions effect the accuracy of the source selection. For each query, we look at the number of sources selected and the resulting number of requests to the SPARQL endpoints. We tested three different source selection approaches, based on 1) predicate index only (no type information), 2) predicate and type index, and 3) predicate and type index and grouping of sameAs patterns as described in Section 4.2.Property "Has Description" (as page type) with input value "we investigated how the information from the VOIDdescriptions effect the accuracy of the source selection. For each query, we look atthe number of sources selected and the resulting number of requests to the SPARQLendpoints. We tested three different source selection approaches, based on 1) predicateindex only (no type information), 2) predicate and type index, and 3) predicate and typeindex and grouping of sameAs patterns as described in Section 4.2." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Dimensions: Performance

Benchmark used: FedBench

Results: AliBaba and DARQ fail to return results for six out of the 14 queries for different reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, and LS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5 DARQ opens too many connections to GeoNames. All other unsuccessful queries take longer than the time limit of five minutes. Overall, FedX has the best query evaluation performance. The reason is its novel and efficient query execution based on block transmission of result tuples and parallelization of joins. However, there is only a significant difference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the other queries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, which indicates that SPLENDID, indeed, generates better query execution plans.Property "Has Results" (as page type) with input value "AliBaba and DARQ fail to return results for six out of the 14 queries fordifferent reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, andLS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5DARQ opens too many connections to GeoNames. All other unsuccessful queries takelonger than the time limit of five minutes. Overall, FedX has the best query evaluationperformance. The reason is its novel and efficient query execution based on block transmissionof result tuples and parallelization of joins. However, there is only a significantdifference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the otherqueries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, whichindicates that SPLENDID, indeed, generates better query execution plans." contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.