Synthetic data: provocations and frictions of emerging data regimes

Synthetic data are defined as information created by computer simulations or algorithms reproducing structural and statistical properties of so-called “real-world” data. In recent years, synthetic data have emerged as a promise to fix such problems of the AI industry as more and better-quality data for training datasets, protecting personal data via anonymization, and mitigating data bias. A growing field of scholarship uncovers issues regarding the emergence of new data economies and labour practices in synthetic data industries. Furthermore, synthetic data impacts the level of policy, data regulation and notions of personal data. Although the idea of synthetic data has been known and well-used since the 1940s, its potential applications as training data for AI present new challenges and require thorough investigation from social sciences and the humanities.

By surfacing frictions and provocations of synthetic data, the seminar aims to create a fruitful dialogue within a nascent field. We invite contributions on synthetic data and the synthetic turn from the social sciences and humanities, such as e.g. critical data studies, cultural studies, cultural theory, philosophy, political economy, sociology and STS.

Attendance is free but requires registration. If you want to present a paper, please send a short abstract (around 150-200 words) to kpas@hum.ku.dk by 10 October.

Programme

09:30 – Coffee & croissants

09:45 – Welcome

10:00 – Keynote: James Steinhoff

 

Synthetic data: new challenges for critical data studies

This talk introduces critical research on synthetic data: the latest darling of the artificial intelligence industry. Synthetic data is data which is generated by a computational process rather than captured by recording phenomena from the real world or digital platforms. With a synthetic dataset consisting of procedurally generated 3D models of faces, one can train a working facial recognition model that has never been exposed to a real human face (Wood et al. 2021). I contend that synthetic data has political, economic, epistemological and ontological implications for the critical study of data which have only just begun to be explored (Steinhoff 2022). It is no exaggeration to say that synthetic data is being positioned as the cure for all of the big problems facing AI. It is purported to be the solution to: data bias, privacy/surveillance and the labour costs associated with data collection and labeling. I discuss the nascent synthetic data industry and its global scope, as well as the most popular applications of the technology. It will also be necessary to explore the “reality gap”, the central technical problem posed by synthetic data–whether a model trained on synthetic data will function when deployed in the real world. I discuss the critical implications of synthetic data in several dimensions: the ontological connection between synthetic data and the human and the possibility of a data-intensive capitalism which does not rely on surveillance, the novel spatial and temporal dynamics of synthetic data and the labour process by which synthetic data is produced in simulations. I conclude with some considerations on the future of synthetic data in relation to the problem of model collapse.

References

Steinhoff, J., 2022. Toward a political economy of synthetic data: A data-intensive capitalism that is not a surveillance capitalism?. New Media & Society 26(6).

Wood, E., Baltrušaitis, T., Hewitt, C., Dziadzio, S., Cashman, T.J. and Shotton, J., 2021. Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF International Conference on Computer Vision (3681-3691).


Bio

“James Steinhoff is Assistant Professor in the School of Information and Communication Studies at University College Dublin. His research focuses on automation and the political economy of AI and data. He is the author of Automation and Autonomy: Labour, Capital and Machines in the Artificial Intelligence Industry (Palgrave 2021) and co-author of Inhuman Power: Artificial Intelligence and the Future of Capitalism (Pluto 2019).”

 

11:30 – Lunch break

Presentations

(Authors have 15 min to present - followed by a short Q&A)

12:30 – Louis Ravn

12:50 – Charlotte Högberg

13:10 – Hannah Devinney

13:30 – Ekaterina Pashevich & Jens Ulrik Hansen

14:00 – Coffee & cake

14:30 – Roundtable

 

 

 

Investigating the ‘Global Synthetic Dataset’(s) on human trafficking: Towards synthetic data justice?

Louis Ravn, University of Copenhagen

Abstract

The rise of synthetic data has begun to inspire novel data-driven projects in highly sensitive contexts. A prominent example concerns the three ‘Global Synthetic Dataset’(s) (GSDs) – the 2024 iteration containing over 206,000 data points on survivors of human trafficking – that have been released by the UN’s International Organization for Migration and Microsoft. The aim of publishing these datasets is to aid the combat against human trafficking by leveraging synthetic data’s promise of privacy – a crucial requirement in this domain. In this provocation, I foreground how synthetic data’s ambivalences – oscillating between privacy promises and the foreclosure of critical debates – play out in the context of the GSDs. To that end, this article draws from Taylor’s framework of data justice, asking how its pillars are reconfigured through synthetic data. Drawing from publicly available materials as well as three key informant interviews, I suggest that the GSDs continue long-standing data justice concerns but also engender new questions corresponding to synthetic data’s specificities. Accordingly, the provocation of this talk is to ask what a synthetic data justice framework may look like. This is urgent as it directs scholars and practitioners towards critical questions for synthetic data as their use seeps into increasingly sensitive contexts.


“This ground truth is muddy anyway”: Dealing with ground truth data assemblages for medical AI

Charlotte Högberg, Lund University

Abstract

This paper explores assemblages of ground truth datasets for medical artificial intelligence (AI). By drawing from interviews and observations, I examine how researchers involved in developing medical AI relate to the referential truth basis of their work, their ground truths, and its epistemic implications. As datasets are assembled from different sources, and produced, augmented and synthesized, this study shows the role of human expertise, perceived strengths and limits of expert-based annotations, (in)stability of medical classifications, and sometimes, data frictions. The different valuations of data sources and the perceived brittleness of documentation and labels shatter the image of classifications as stable neutral entities. Limits and absences of ground truths moreover make visible the perceived promises and worries pertaining to medical AI and synthetic data. In sum, this paper shows how ground truths are negotiated as shaping both medical phenomena, data work, and algorithms, suggesting the epistemic implications of these ideas and practices. To distinguish the possibilities of medical AI to be fair, trustworthy or transparent, we need more knowledge on the assumptions upon which medical AI is built.


The problem of representation in synthetic data

Hannah Devinney, Linköping University

Katherine Harrison, Linköping University

Irina Shklovski, University of Copenhagen

Abstract

Data is often sparse, biased, or private. Synthetic data is attractive in part because it promises to solve these problems. In the case of bias, this promise is framed as data that can be as perfectly representative as needed for your application. While there is no question that synthetic data can address the issue of data scarcity and the questions of privacy are specific to the particular types of data generated, the challenge of bias and diversity in synthetic data is less explored. 

Synthetic data generation approaches promise the capacity to generate improved versions of datasets with more balanced representation, for example by increasing the occurrence of rare events, providing robustness to outliers, or balancing datasets along a particular category (Whitney and Norman, 2024). However, such a claim raises its own questions. What do we mean by “diversity” or “representation” within data, whether synthetic or non-synthetic? What are the implications of this meaning with regards to data and algorithmic bias?

To produce “diverse” datasets we must first decide what sort of diversity we want to achieve. Should a dataset accurately reflect the world as-is, or should it represent some alternative reality or imagined future? We must also decide what statistical models and techniques might be able to approximate such diversity outcomes. The optimism around synthetic data as a solution against data bias implies a certainty about what synthetic data should look like. It also implies that extant statistical and modelling techniques can precisely produce the required distributions. Yet studies on language models suggest that representing imagined futures is much harder than it sounds (Hofmann et al., 2024; Devinney et al., 2024) and evaluations of synthetic datasets demonstrate that diversity through reproduction of statistical distributions can result in absurdity (Johnson and Hajisharif, 2024). This talk will explore the places where synthetic data’s promise of “representation” to counter bias falls short or collides with other issues.

References

Devinney, H., Björklund, J., and Björklund, H. (2024). We don’t talk about that: Case studies on intersectional analysis of social bias in large language models. In Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 33–44, Bangkok, Thailand. Association for Computational Linguistics.

Hofmann, V., Kalluri, P. R., Jurafsky, D., and King, S. (2024). Ai generates covertly racist decisions about people based on their dialect. Nature, 633(8028):147–154.

Johnson, E. and Hajisharif, S. (2024). The intersectional hallucinations of synthetic data. AI & SOCIETY. 

Whitney, C. D. and Norman, J. (2024). Real risks of fake data: Synthetic data, diversity-washing and consent circumvention. FAccT ’24, page 1733–1744, New York, NY, USA. Association for Computing Machinery.


What is synthetic data anyway?

Ekaterina Pashevich, University of Copenhagen

Jens Ulrik Hansen, Roskilde University

Abstract

Synthetic data is seen as an alternative to collecting expensive and time-consuming real-world data, as well as a solution to data insufficiency (Nikolenko, 2021). While its main promise is to assist businesses with privacy and compliance (Raghunathan, 2021), its potentially most intriguing application is providing synthetically generated datasets for training AI models (Wood et al., 2021). Considering this application of synthetic data, we raise the question: how much can we learn about the world from data that is intentionally not real?

Although synthetic data has been used since 1940s to solve statistical problems on simulation methods (Jordon et al., 2022), and especially in the 1960s in experiments with computer vision (Nikolenko, 2021), its current application in training algorithmic models could have far-reaching consequences. Specifically, there is a concern that the real world might be shaped by how it is represented in synthetic datasets. In our presentation, we will explore the range of data manipulations that fall under the umbrella of "synthetic data" and examine the boundaries of what can truly be considered synthetic. Ultimately, we aim to assess whether synthetic data can fulfill its promises and what broader implications this may carry for society.

References

Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., ... & Weller, A. (2022). Synthetic Data – what, why and how? London: The Royal Society. arXiv preprint arXiv:2205.03257.

Nikolenko, S.I. (2021). Synthetic Data for Deep Learning. Cham: Springer Nature.

Raghunathan, T.E. (2021). Synthetic Data. Annual Review of Statistics and Its Application. https://doi.org/10.1146/annurev-statistics-040720-031848

Wood, E., Baltrušaitis, T., Hewitt, C., Dziadzio, S., Cashman, T. J., & Shotton, J. (2021). Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3681-3691).