Academic Journal
Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government
| Title: | Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government |
|---|---|
| Authors: | Barrientos, Andrés F., Bolton, Alexander, Balmat, Tom, Reiter, Jerome P., de Figueiredo, John M., Machanavajjhala, Ashwin, Chen, Yan, Kneifel, Charley, DeLong, Mark |
| Source: | Ann. Appl. Stat. 12, no. 2 (2018), 1124-1156 |
| Publication Status: | Preprint |
| Publisher Information: | Institute of Mathematical Statistics, 2018. |
| Publication Year: | 2018 |
| Subject Terms: | FOS: Computer and information sciences, Applied Statistics, public, Disclosure, privacy, remote, Statistics - Applications, 01 natural sciences, United States--Officials and employees--Data processing, synthetic, Applications (stat.AP), Social sciences--Statistical methods, 0101 mathematics, Data protection |
| Description: | Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies. |
| Document Type: | Article Other literature type |
| File Description: | application/pdf |
| ISSN: | 1932-6157 |
| DOI: | 10.1214/18-aoas1194 |
| DOI: | 10.48550/arxiv.1705.07872 |
| Access URL: | https://projecteuclid.org/journals/annals-of-applied-statistics/volume-12/issue-2/Providing-access-to-confidential-research-data-through-synthesis-and-verification/10.1214/18-AOAS1194.pdf http://arxiv.org/abs/1705.07872 https://projecteuclid.org/journals/annals-of-applied-statistics/volume-12/issue-2/Providing-access-to-confidential-research-data-through-synthesis-and-verification/10.1214/18-AOAS1194.full https://projecteuclid.org/download/pdfview_1/euclid.aoas/1532743488 https://projecteuclid.org/euclid.aoas/1532743488 |
| Rights: | arXiv Non-Exclusive Distribution |
| Accession Number: | edsair.doi.dedup.....25c6694768aa8915fe569ef92c2a80e6 |
| Database: | OpenAIRE |
| ISSN: | 19326157 |
|---|---|
| DOI: | 10.1214/18-aoas1194 |