Computational social science has the potential to address pressing challenges, but interdisciplinary collaboration is crucial.
The DSI Computational Social Science Working Group invites researchers to a new meeting series exploring the intersection of data science and the social sciences. Sessions will provide an informal space for sharing work in progress and discussing new methods, collaborations, and shared interests.
Join this working group to explore this exciting interdisciplinary area and potentially lay the groundwork for future projects.
This meeting series is made possible by support from the Institute for Social and Economic Research and Policy (ISERP).
The series is open to Columbia University faculty members and affiliated senior researchers from interested in data and the social sciences. If you’d like to join these meetings, contact Isabella Plachter at ip2484@columbia.edu.
When available, abstracts for each speaker will be added and archived below.
Abstract: Only 37% of sub-Saharan Africans use the web, but many use WhatsApp. We find that teachers in Sierra Leone use a WhatsApp AI chatbot more often than web search for teaching assistance. Given the cost of bandwidth relative to compute, querying AI is 87% less costly than loading an average webpage. Teachers also rate AI responses as higher quality. I’ll describe how teachers use AI, and discuss two ideas on designing information services for the people not well served by the web.
–
Abstract: As data and machine learning become ever so pervasive, a vast literature chartered how to include “fairness” conditions inside various optimization pipelines. But how to deploy “fairness intervention”? This talk encourages to focus on those challenges, typically outside of the traditional pipeline, with effects on root causes and bottom lines. We’ll focus on two examples from recent collaborations, published at ACM EC and TheWebConf: Understanding the effect of fairness training/prompts on programmers (in the pre-LLM era) and studying the cost of fairness in a data exchange ecosystem.
Abstract: Racial bias in policing remains a central concern in academic research and public policy. While most existing work emphasizes individual-level explanations, this talk draws on theories of group threat and social learning to examine how the racial and ethnic composition of an officer’s peers shapes on-the-job decision-making. Using administrative data on 4.8 million daily patrol assignments and a quasi-experimental research design, we show that officer behavior is deeply influenced by workplace dynamics and peer composition. Patrol assignments exhibit high levels of racial homophily, even after accounting for sorting across districts and shifts. We also find that White officers are significantly less likely to use force against Black pedestrians when paired with racially diverse partners, especially early in their careers. These findings push policing research toward a more relational and ecological understanding of officer behavior and underscore the potential of organizational interventions to reduce racial disparities in use of force.
Abstract: Graduates earn a large earnings premium for bachelor’s degree completion, but low SES graduates still earn substantially less than their high-SES peers. We examine the role of the first (post-college) job transition. Students with financial, informational, or other disadvantages during the job search may be more likely to “undermatch” in their early career jobs. Using administrative data from a large, urban, public college system, we document large gaps in earnings five years after graduation by SES that remain unexplained even after controlling for GPA, field of study, and other pre-graduation characteristics. We then examine how features of the initial job transition relate to longer-term earnings and whether differences in the first job transition can explain the SES earnings gaps that exist five years after college. We find that first job transitions are rocky for many graduates, strongly predict earnings at Year 5, and are a substantial mediator of socioeconomic gaps in earnings five years after college graduation—reducing the unexplained gap by almost two-thirds.
Abstract: Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement—continued reading, commenting, and voting—are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.
Abstract: The universal availability of ChatGPT and other similar tools since late 2022 has prompted tremendous public excitement and experimental effort about the potential of large language models (LLMs) to improve learning experience and outcomes, especially for learners from disadvantaged backgrounds. However, little research has systematically examined the real-world impacts of LLM availability on educational equity beyond theoretical projections and controlled studies of innovative LLM applications. To depict trends of post-LLM inequalities, we analyze 1,140,328 academic writing submissions from 16,791 college students across 2,391 courses between 2021 and 2024 at a public, minority-serving institution in the US. We find that students’ overall writing quality gradually increased following the availability of LLMs and that the writing quality gaps between linguistically advantaged and disadvantaged students became increasingly narrower. However, this equitizing effect was more concentrated on students with higher socioeconomic status. These findings shed light on the digital divides in the era of LLMs and raise questions about the equity benefits of LLMs in early stages and highlight the need for researchers and practitioners on developing responsible practices to improve educational equity through LLMs.
Abstract: From political science and economics to public health and corporate strategy, the randomized experiment is a widely used methodological tool for estimating causal effects. In the past 15 years or so, there has been a growing interest in network experiments, where subjects are presumed to be interacting in the experiment and their interactions are of substantive interest. I will give a brief survey of some of my recent work on designing randomized network experiments. I’ll highlight the recent “Conflict Graph Design” which, given a pre-specified causal effect of interest and the underlying network, produces a randomization over treatment assignment with the goal of increasing the precision of effect estimation. Not only does this experiment design attain improved rates of consistency for several causal effects of interest, it also provides a unifying approach to designing network experiments.
Abstract: Practical problems present challenges for AI-based solutions which function well for well defined/sanitised problems and clean data. Significant amount of work is required for data conditioning, preprocessing, elimination of outliers, and adapting machine learning or deep learning models to problems at hand. This talk summarizes our AI-based research addressing the issue above, in the following areas: (1) Real time video-based analysis of traffic intersection scenes; (2) Monitoring trends of patient’s medical condition using audio recordings; (3) Capturing plastics on waterways.
Abstract: I will give a broad overview of the work done in Columbia’s Labor Lab (www.laborlabcu.org) using machine learning and randomized experiments to solve organizational problems inside US labor unions. I will use the specific example of predicting strike captains in a large local, and present results from a pilot experiment showing the predictions usefulness in practice.
Abstract: Consider a setting where there are N heterogeneous units and p interventions. Our goal is to learn unit-specific potential outcomes for any combination of these p interventions, i.e., N × 2^p causal parameters. Choosing a combination of interventions is a problem that naturally arises in a variety of applications such as factorial design experiments, recommendation engines, combination therapies in medicine, conjoint analysis, etc. Running N × 2^p experiments to estimate the various parameters is likely expensive and/or infeasible as N and p grow. Further, with observational data there is likely confounding, i.e., whether or not a unit is seen under a combination is correlated with its potential outcome under that combination. To address these challenges, we propose a novel latent factor model that imposes structure across units (i.e., the matrix of potential outcomes is approximately rank r), and combinations of interventions (i.e., the coefficients in the Fourier expansion of the potential outcomes is approximately s sparse). We establish identification for all N × 2^p parameters despite unobserved confounding. We propose an estimation procedure, Synthetic Combinations, and establish it is finite-sample consistent and asymptotically normal under precise conditions on the observation pattern. Our results imply consistent estimation given poly(r)× N +s 2p observations, while previous methods have sample complexity scaling as min(N × s^2 p, poly(r)×(N + 2^p )). We use Synthetic Combinations to propose a data-efficient experimental design. Empirically, Synthetic Combinations outperforms competing approaches on a real-world dataset on movie recommendations. Lastly, we extend our analysis to do causal inference where the intervention is a permutation over p items (e.g., rankings)
Abstract: A crucial input into causal inference is the imputed counterfactual outcome. Imputation error can arise because of sampling uncertainty from estimating the prediction model using the untreated observations, or from out-of-sample information not captured by the model. While the literature has focused on sampling uncertainty, it vanishes with the sample size. Often overlooked is the possibility that the out-of-sample error can be informative about the missing counterfactual outcome if it is mutually or serially correlated. Motivated by the best linear unbiased predictor (\blup) of Goldberger (1962) in a time series setting, we propose an improved predictor of potential outcome when the errors are correlated. The proposed \pup\; is practical as it is not restricted to linear models, can be used with consistent estimators already developed, and improves mean-squared error for a large class of strong mixing error processes. Ignoring predictability in the errors can distort conditional inference. However, the precise impact will depend on the choice of estimator as well as the realized values of the residuals.
Abstract: We investigate the use of large language models (LLMs) to simulate human responses to survey questions, and perform uncertainty quantification to gain reliable insights. Our approach converts imperfect, LLM-simulated responses into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.
Abstract: Multi-site/context studies have become popular strategies to address the most common and challenging external validity concerns about contexts. Under such studies, scholars conduct causal studies in each site and evaluate whether findings generalize across sites. Despite the potential, there has been little guidance on the fundamental research design question—how should we select sites for external validity? Existing approaches have challenges: random sampling of sites is often infeasible, while the current practice of purposive sampling is suboptimal without statistical guarantees. We propose synthetic purposive sampling (SPS), which optimally selects diverse sites for external validity. SPS combines ideas of purposive sampling and the synthetic control method—it selects diverse sites such that non-selected sites are well approximated by the weighted average of the selected sites. We illustrate its general applicability using both experimental and observational studies. Overall, this paper offers a new statistical foundation to design multi-site studies for external validity.
Abstract: We study the consequences of affirmative action in centralized college admis- sions systems. We develop an empirical framework to examine the effects of a large-scale program in Brazil that required all federal institutions to reserve half their seats for socioeconomically and racially marginalized groups. By exploiting admissions cutoffs, we find that marginally benefited students are more likely to attend college and are enrolled at higher-quality degrees four years later. Mean- while, there are no observed impacts for marginally displaced non-targeted stu- dents. To study the effects of larger changes in affirmative action, we estimate a joint model of school choices and potential outcomes. We find that the policy has impacts on college attendance and persistence that imply a virtually one-to- one income transfer from the non-targeted to the targeted group. These findings indicate that introducing affirmative action can increase equity without affecting efficiency.
Abstract: Two ongoing problems in experimental research are (a) more credible ideas for addressing social problems are generated than can be experimentally tested and (b) treatment effects estimated in one setting may not necessarily apply to other settings. My talk will discuss the potential of forecasting tournaments as an additional tool for solving these twin problems. In a completed study, we evaluate whether experts or laypeople can accurately forecast the efficacy of interventions to strengthen Americans’ democratic attitudes, and thus the potential of forecasters to identify the most promising interventions to test. All forecasts performed better than chance, but experts outperformed the lay public, and academic v. practitioner experts differed in their sensitivity and specificity. Hence, depending on the relative importance of avoiding false-positive vs. false-negative forecasts, decision-makers may prefer different experts. In an ongoing study, I investigate if lay people can accurately forecast how the effects of RCT-tested educational interventions generalize to their specific school districts. I also plan to benchmark the accuracy of these forecasts against those from various large language models, and whether inviting the public to participate in forecasting improves community support for evidence-based policymaking.
Abstract: Political spending is at an all-time high. It has skyrocketed from $3.1 billion in 2000 to $15.9 billion in 2024—a 416% increase. This increase has inspired campaigns to experiment with new types of political ads. In this research, we investigate a novel form of political advertising, which we refer to as “disloyalty ads.” Disloyalty ads are ads in which a candidate disagrees with their party on a political issue. How do people react to the use of Disloyalty Ads? Under what conditions do they boost candidate support? We investigate candidate-, voter-, and issue-level variation that predicts the success of this form of advertising. For this talk, I will focus on issue-level variation. Candidates using disloyalty ads can challenge their party on issues associated with the in-party (e.g., a Democrat challenging expanding LGBTQ+ rights). Alternatively, candidates can align themselves with the opposing party on issues associated with the out-party (e.g., a Democrat supporting limited government). Which approach improves candidate support? Across four studies (N = 5,142) using a new method that combines LLM embeddings and clustering algorithms, we find that candidates are better off signaling disloyalty on issues associated with the out-party.
Abstract: Democracy is in a global crisis, which technology has not so far helped us overcome. Large language models provide a new tool for finding answers to specific questions and summarizing complex information. However, the monolithic perspective on reality that chatbots provide – tuned by technology corporations – raises questions about their role in liberal democracy. I will argue that we can use AI technology to capture and map a plurality of perspectives and scale a classical element of participatory democracy: citizen assemblies. Citizen assemblies are deliberative forums of randomly selected citizens brought together to discuss a particular issue and share their opinions. The goal is for the opinions of the participants to impact the political process, e.g., through recommendations to representatives of the government. In recent years, citizen assemblies have gained some traction worldwide, particularly in Europe and North America. For instance, the French Citizens’ Convention for Climate (2019-2020) brought together 150 citizens to propose measures for reducing greenhouse gas emissions. Similarly, the Irish Citizens’ Assembly (2016-2018) played an important role in shaping public opinion and government policy on issues like abortion and same-sex marriage. AI technology can help us (1) scale citizen assemblies affordably to thousands or millions of participants and (2) map the landscape of nuanced opinions in a transparent and verifiable fashion. Assemblies can be recorded and transcribed automatically, and the variety of positions summarized using language models. Novel methods combining language models and multivariate analyses can then capture the complete space of opinions as a graph whose nodes represent opinions at a range of levels of abstraction, from broad to nuanced. People (including but not limited to the participants of the assemblies) can traverse the graph online, attach ratings of their degree of belief in the opinions, and add opinions not yet captured. Implementing this project at Columbia would require a broad collaboration among faculty spanning the humanities, the social sciences, the natural sciences, and engineering. The project would leverage modern AI to help make good on the promise of liberal democracy.