Providing high-quality content or facilitating a uniform access to content, through effective tools, is undoubtedly one major contribution to the renown of the World Wide Web, as evidenced by an increasing irrational trust in Web data sources for personal and business purposes. The latest advances to the Web are notably represented by the emergence of the multi-source paradigm, i.e., Web applications which, rather than relying just on one particular data source, exploit information from various different sources in order to construct valuable data or to ease access to them.
Multi-source Web applications cover nowadays a broad portion of the inestimable amount of data on the Web. These applications can be of various nature with respect to the inherent manner sources are involved for content production. Multi-source Web systems may correspond to Web-scale collaborative systems (Wikipedia or Google Drive ) where sources, i.e., contributors, are active actors for content creation, or to domain-specific Web systems (Google Hotel Finder , Yahoo! Finance , MarineTraffic , etc.) in which data that already reside on multiple heterogeneous sources are unified. The success of both types of Web applications is greatly due to an integration of views from multiple structured sources, the level with which their information – given its structuring – corroborate each other being a qualitative clue about the relevance of the result. As telling examples, the online-encyclopedia Wikipedia is built on a quite large community of active contributors, over structured articles, even though its size has stopped to grow during the last years [Voss, 2005, Halfaker et al., 2013]. Domain-specific Web applications such as travel booking Websites aggregate data from thousands of sources with well-known templates, including hundred Web portals of hotels, flight companies, etc. Relying on multiple sources, thereby a wider spectrum of different data quality level and a higher probability of conflicts, inevitably leads to some expectations in terms of sources’ trustworthiness and data relevance levels. This requires efficient uncertainty management and data quality estimation algorithms during data integration. Uncertain data integration is indeed a building block for multi-source Web applications, as few will continue to trust these systems if users cannot find the offered data without uncertainties and relevant to their needs.
Uncertain Multi-Version Tree Data
As highlighted earlier, Web-scale collaborative systems form a significant type of multi-source Web applications with well-known examples such as the free encyclopedia Wikipedia and the Google Drive collaboration tools. Content management, in many collaborative editing systems, especially online platforms, is realized within a version control setting. A typical user, willing to build knowledge around a given subject by involving views from other persons, creates and shares an initial version of the content, possibly consisting of only the title, for further revisions. The version control system then tracks the subsequent versions of the shared content, as well as changes, in order to enable fixing error made in the revision process, querying past versions, and integrating content from different contributors. Much effort related to version control has been carried out both in research and in applications; see surveys [Altmanninger et al., 2009, Koc and Tansel, 2011]. The prime applications were collaborative document authoring process, computer-aided design, and software development systems. Currently, powerful version control tools, such as Subversion [Pilato, 2004] and Git [Chacon, 2009], efficiently manage large source code repositories and shared file-systems.
Unfortunately, existing approaches leave no room for uncertainty handling, for instance, uncertain data resulting from conflicts. Contradictions are usual during a collaborative job – in particular when the setting is open – as appearing whenever concurrent edits try changing the identical content leading to ambiguities in content management. But sources of uncertainties in the version control process are not only due to conflicts. In fact, Web-scale collaborative editing platforms are inherently uncertain: platforms such as Wikipedia or Google Drive enable unbounded interactions between a large number of contributors, without prior knowledge of their level of expertise and reliability. Uncertainty, in such environments, is omnipresent due to the unreliability of the sources, the incompleteness and imprecision of the contributions, the possibility of malicious editing and vandalism acts, etc. This drives the need for an entirely new way to manage the different versions of the shared content and to access them. In other terms, a version control technique able to properly handle uncertain data may be very useful for this class of applications, as we illustrate next with two applications scenarios.
Web Data Integration under Constraints
Integrating data from multiple heterogeneous sources is one of the most common tasks in classical multi-source Web systems. For example, many platforms on the Web, and in particular domain-specific applications like online hotel reservation services, use and integrate data from hundreds of different sources . More importantly such platforms, aside their own specific ways of gathering information, often try capitalizing on data already on the Web – nowadays, there is a proliferation of Web systems, e.g., sharing platforms, company Websites, or general Websites, which collect and keep up-to date as much as possible a vast amount of information about various real-life areas – for their needs in order to meet users’ expectations. It is known that Web sources are mostly untrustworthy and their data usually come with uncertainties; for instance, sources can be unreliable, data can be obsolete or extracted using imprecise automatic methods. At the same time, dependent sources – copying between sources is a reality on the Web as described in [Dong et al., 2010] – or geographical information (in the case of moving objects) may be involved in the data integration process, and these aspects should be carefully taken into consideration. When integrating uncertain data, the relationships between sources or the intrinsic nature of the information may indeed influence both the modeling and the assessment of uncertainties.
We consider in this work the problem of uncertain data integration given some constraints. We first study how to reconcile uncertain and dependent Web sources, and then we tackle the problem in the presence of spatio-temporal information.
|
Table des matières
1 Introduction
1.1 Uncertain Multi-Version Tree Data
1.2 Web Data Integration under Constraints
1.3 Truth Finding with Correlated Data Attributes
I Uncertain Multi-Version Tree Data
2 Uncertain XML Version Control Model
2.1 Related Work
2.1.1 Research on Version Control
2.1.2 Uncertain Tree-Structured Data Models
2.1.3 Quality in Collaborative Editing Systems
2.2 Preliminaries
2.3 Probabilistic XML
2.4 Uncertain Multi-Version XML Setting
2.4.1 Multi-Version XML Documents
2.4.2 Uncertain Multi-Version XML Documents
2.4.3 Probabilistic XML Encoding Model
2.5 Conclusion
3 Updates in Uncertain XML Version Control
3.1 Updating Uncertain Multi-Version XML
3.1.1 Uncertain Update Operation
3.1.2 Uncertain Update over Probabilistic XML Encoding
3.2 Evaluation of the Uncertain XML Version Control Model
3.2.1 Performance analysis
3.2.2 Filtering Capabilities
3.3 Conclusion
4 Merging in Uncertain XML Version Control
4.1 Related Work
4.2 A Typical Three-Way Merge Process
4.2.1 Edit Detection
4.2.2 Common Merge Cases
4.3 Merging uncertain Multi-Version XML
4.3.1 Uncertain Merge Operation
4.3.2 Uncertain Merging over Probabilistic XML Encoding
4.4 Conclusion
II Structured Web Data Integration
5 Web Data Integration under Constraints
5.1 Related Work
5.2 Motivating Application
5.2.1 Multiple Web Sources
5.2.2 Uncertain Web Data Sources
5.2.3 Copying Relationships between Sources
5.3 Web Data Integration under Dependent Sources
5.3.1 Main Prerequisites
5.3.2 Probabilistic Tree Data Integration System
5.4 Uncertain Web Information on Moving Objects
5.4.1 Data Extraction
5.4.2 Uncertainty Estimation
5.4.2.1 Precision of Location Data
5.4.2.2 Computing User Trust Score
5.4.2.3 Integrating Uncertain Attribute Values
5.5 Maritime Traffic Application
5.5.1 Use Case
5.5.2 System Implementation
5.5.3 Demonstration scenario
5.6 Conclusions
6 Truth Finding over Structured Web Sources
6.1 Related Work
6.2 Preliminaries and Problem Definition
6.2.1 Preliminary Definitions
6.2.2 Accu Truth Finding Algorithm
6.2.3 Problem Definition
6.3 Partition-Aware Truth Finding Process
6.3.1 Weight Function for Partitions
6.3.2 Exact Exploration Algorithm
6.3.3 Approximative Exploration
6.4 Experimental Evaluation
6.5 Conclusion
7 Research Perspectives
7.1 Uncertain Multi-Version Tree Data
7.2 Web Data Integration under Constraints
7.3 Truth Finding with Correlated Data Attributes
A Other Collaborations
8 Conclusion
Télécharger le rapport complet