Eris: efficiently measuring discord in multidimensional sources

Λεπτομέρειες βιβλιογραφικής εγγραφής
Τίτλος: Eris: efficiently measuring discord in multidimensional sources
Συγγραφείς: Alberto Abelló, James Cheney
Συνεισφορές: Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació, Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering
Πηγή: UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
Στοιχεία εκδότη: Springer Science and Business Media LLC, 2022.
Έτος έκδοσης: 2022
Θεματικοί όροι: Bases de dades, COVID-19 (Disease), Datafusion, Data alignment, Multidimensional schema, Decisió, Presa de, Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació::Bases de dades, Presa de, Decisió, Data integration (Computer science), COVID-19 (Malaltia), Decision-making, Trustworthiness
Περιγραφή: Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground truth, it is important to at least quantify how far from being internally consistent a dataset is. Hence, we propose definitions of concordant data and define a discordance metric as a way of measuring disagreement to improve decision-making based on trustworthiness. We define the discord measurement problem of numerical attributes in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different conceptualizations of the same reality (e.g., granularities or units), we wish to assess whether the different sources are concordant, or if not, use the discordance metric to quantify how discordant they are. We also define a set of algebraic operators to describe the alignments of different data sources with correctness guarantees, together with two alternative relational database implementations that reduce the problem to linear or quadratic programming. These are evaluated against both COVID-19 and synthetic data, and our experimental results show that discordance measurement can be performed efficiently in realistic situations.
Τύπος εγγράφου: Article
Περιγραφή αρχείου: application/pdf
Γλώσσα: English
ISSN: 0949-877X
1066-8888
DOI: 10.1007/s00778-023-00810-3
DOI: 10.2139/ssrn.4184515
Rights: CC BY
Αριθμός Καταχώρησης: edsair.doi.dedup.....ff843f11aec056144a10f6d41b936974
Βάση Δεδομένων: OpenAIRE
Περιγραφή
ISSN:0949877X
10668888
DOI:10.1007/s00778-023-00810-3