Data Science Day 2024
Thursday, April 4, 2024
4:00 am - 1:00 pm
Thursday, April 4, 2024
4:00 am - 1:00 pm
The Data Science Institute’s flagship annual event connects innovators in industry and government to Columbia researchers who are propelling advances across every sector with data science. The 2024 event featured a keynote presentation from Prabhakar Raghavan, Senior Vice President, Google; opening remarks from Minouche Shafik, President, Columbia University; three sessions of Columbia-led lightning talks; and 100+ interactive posters and technology demonstrations.
Clifford Stein, Interim Director of The Data Science Institute; and Wai T. Chang Professor of Industrial Engineering and Operations Research and Professor of Computer Science at Columbia University, was the master of ceremonies.
An industry leader and accomplished scholar, Raghavan is responsible for Google’s Knowledge & Information products, including Google Search, News, Assistant, Bard, Geo, Ads, Commerce and Payments.
Raghavan is one of the foremost authorities on Search and has over 20 years of research experience spanning algorithms, web search, and databases. He has published more than 100 papers, co-authored two widely-used graduate texts, and holds 20 issued patents, including several on link analysis for web search.
Prabhakar holds a Ph.D. from U.C. Berkeley in Electrical Engineering and Computer Science and a Bachelor of Technology from the Indian Institute of Technology, Madras. He is a member of the National Academy of Engineering; a Fellow of the ACM and IEEE; a former editor in chief for the Journal of the ACM; and was a Consulting Professor of Computer Science at Stanford University. In 2009, he was awarded a Laurea honoris causa from the University of Bologna.
Before joining Google, Prabhakar founded and led Yahoo! Labs, served as CTO at Verity, and spent 14 years at IBM Research.
Talk Title: Beyond information retrieval: What does Search mean these days?
Classic information retrieval systems aimed to retrieve documents best matching a user’s query. The advent of the web dramatically changed this landscape and led to the creation of what is considered the modern day search engine. In this talk, we will explore this evolution and the nuanced questions that arise in the operation of a search engine at planetary scale: information quality vs. misinformation; limitations in query understanding, the corpus of content, and information fragmentation; and the economics and value exchange of web search. We are in the early stages of these staples being transformed by the advent of large language models. The future of search lies in the continued refinement of the interplay between evolving technologies, economic models, and user behaviors. We will delve into the new set of opportunities and challenges posed by this shift.
From selecting amongst equally accurate models for prediction tasks, to training learning-based algorithms on subjective clinical diagnostic data, to setting e-commerce prices based on consumer valuation, algorithmic models can raise thorny questions about the very meaning of fairness and how to address it. After three leading data scientists discuss work that intersects with these issues, a philosopher will lead the discussion where innovation confronts ethics and philosophy.
Adam Elmachtoub
Associate Professor, Department of Industrial Engineering and Operations Research, Columbia Engineering
Talk Title: Embedding Fairness into Pricing Algorithms
Abstract: Price discrimination algorithms, which offer different prices to customers based on differences in their valuations, have become common practice. While it allows sellers to increase their profits, it also raises several concerns in terms of fairness, e.g., by charging higher prices (or denying access) to protected groups when they have higher (or lower) valuations than the general population. In this talk, we consider the problem of setting prices for different groups under fairness constraints. We consider different notions of fairness related to prices, access, and consumer surplus under two fundamental settings: an unconstrained monopolist and a vehicle sharing system.
Shalmali Joshi
Assistant Professor, Department of Biomedical Informatics, Vagelos College of Physicians and Surgeons
Talk Title: Characterizing and Operationalizing (Systemic) Algorithmic Fairness in Psychiatric Diagnosis
Abstract: Diagnostic disparities are common in psychiatry due to the lack of quantitative biomarkers and reliance on subjective assessments. Learning-based algorithms are inevitably at risk of relying on statistical biases prevalent in healthcare data. Current notions of algorithmic fairness myopically view fairness in terms of performance metrics of statistical learning algorithms as opposed to actual human outcomes. This shift requires novel statistical tools that will first provide the necessary magnifying lens to characterize the mechanism by which broader societal disparities propagate into learning algorithms. Using such tools, we design improved learning algorithms that will explicitly mitigate some of the sources of unfairness. We operationalize this framework for psychiatric diagnosis of Schizophrenia in Medicaid and diagnosis of respiratory diseases in Medical Imaging Electronic Health Record data.
Emily Black
Assistant Professor, Department of Computer Science, Barnard College
Talk Title: Model Multiplicity and Less Discriminatory Alternatives
Abstract: Recent scholarship has brought attention to the fact that there often exist multiple models for a given prediction task with equal accuracy that differ in their individual-level predictions or aggregate properties. This phenomenon—which we call model multiplicity—can introduce a good deal of flexibility into the model selection process, creating a range of exciting opportunities. By demonstrating that there are many different ways of making equally accurate predictions, multiplicity gives model developers the freedom to prioritize other values in their model selection process without having to abandon their commitment to maximizing accuracy. However, multiplicity also brings to light a concerning truth: model selection on the basis of accuracy alone—the default procedure in many deployment scenarios—fails to consider what might be meaningful differences between equally accurate models with respect to other criteria such as fairness, robustness, and interpretability. Unless these criteria are taken into account explicitly, developers might end up making unnecessary trade-offs or could even mask intentional discrimination. In this talk, we’ll discuss how model multiplicity arises, and its ramifications for US legal and policy response to discriminatory algorithms: particularly, through the disparate impact doctrine.
Moderator: Katja Maria Vogt
Professor, Department of Philosophy, Faculty of Arts and Sciences
Making the Most of an Image will delve into the full life cycle of data science. Talks will explore the frontiers of computational imaging; offer insight on fundamental challenges in coherent imaging systems; share new advances in radiomics, which combines medical imaging with data analysis for precision medicine; and discuss research that pairs imagery with music to evoke emotion and stimulate action.
Shree K. Nayar
T. C. Chang Professor, Department of Computer Science, Columbia Engineering
Talk Title: Future Cameras
Abstract: Computational imaging uses new optics to capture a coded image, and an appropriate algorithm to decode the captured image. This approach has enabled mobile devices to produce images that are rich, immersive, and interactive. In this talk, we will show examples of computational cameras that are transforming the way visual information is captured, communicated, and used by both humans and machines.
Arian Maleki
Associate Professor, Department of Statistics, Faculty of Arts and Sciences
Talk Title: Image Acquisition Challenges in the Presence of Speckle Noise: Theoretical and Practical Insights
Abstract: In addressing one of the most fundamental challenges in coherent imaging systems—the presence of speckle noise—this talk delves into both theoretical and empirical aspects of coherent image acquisition. Our theoretical framework is based on the deep image prior hypothesis that posits the existence of a convolutional neural network with iid noise as input, capable of generating natural images with properly tuned parameters. Our theoretical results reveal that acquiring high-quality images in such systems requires more measurements compared to those with additive noises. Furthermore, they reveal the advantages and bottlenecks of multilooking, a mechanism of capturing multiple images of the same scene. On the applied side, we introduce the “Bagged-DIP” approach, leveraging the DIP-hypothesis to enhance the performance of standard DIPs in image recovery from speckle-corrupted measurements.
Hortense Fong
Assistant Professor, Marketing Division, Columbia Business School
Talk Title: A Theory-Based Explainable Deep Learning Architecture for Music Emotion
Abstract: We develop a theory-based, explainable deep learning convolutional neural network (CNN) classifier to predict the time-varying emotional response to music. We design novel CNN filters that leverage the frequency harmonics structure from acoustic physics known to impact the perception of musical features. Our theory-based model is more parsimonious, but provides comparable predictive performance to atheoretic deep learning models, while performing better than models using handcrafted features. Importantly, the harmonics-based structure placed on the CNN filters provides better explainability for how the model predicts emotional response (valence and arousal), because emotion is closely related to consonance – a perceptual feature defined by the alignment of harmonics. We illustrate the utility of our model with an application involving digital advertising. Motivated by YouTube’s mid-roll ads, we conduct a lab experiment in which we exogenously insert ads at different times within videos. We find that ads placed in emotionally similar contexts increase ad engagement (lower skip rates, higher brand recall rates). Ad insertion based on emotional similarity metrics predicted by our theory-based, explainable model produces comparable or better engagement relative to atheoretic models.
Despina Kontos
Professor, Department of Radiology, Vagelos College of Physicians and Surgeons
Talk Title: Radiomics, Radiogenomics, and AI: The Emerging Role of Imaging Biomarkers in Precision Cancer Care
Abstract: Cancer risk prediction is increasingly playing a key role in personalized screening and prevention. In addition, cancer is a heterogeneous disease, with known inter-tumor and intra-tumor heterogeneity in solid tumors. Established histopathologic prognostic biomarkers generally acquired from a tumor biopsy may be limited by sampling variation. Radiomics is an emerging field with the potential to provide novel markers of cancer risk, as well as leverage the whole tumor via non-invasive sampling to extract high throughput, quantitative features for the volumetric characterization of tumor heterogeneity. Recent studies have shown that radiomic phenotypes can also augment genetic and genomic assays in precision screening, prognosis, and treatment. Identifying novel computational imaging biomarkers and integrating them with other emerging prognostic and predictive markers with data science approaches and AI to better predict patient outcomes has a potential to ultimately transform precision cancer care.
Moderator: Yading Yuan
Herbert and Florence Associate Professor of Radiation Oncology (Physics) (in the Data Science Institute) at the Columbia University Medical Center
To develop and implement climate change solutions at scale, innovation, policy, and profitability are required. This session will explore the role of patents, policy, and models for urban decarbonization. An urban planner with experience in transportation, infrastructure, sustainability, and economic development will moderate the discussion.
James Hicks
Postdoctoral Research Scholar in the Faculty of Law; Lecturer in Law, Columbia Law School
Talk Title: Do Patents Drive Investment in Software?
Abstract: Data science is at the heart of modern innovation. How should the law support and foster research and development in this area? Advocates of strong software patents have long claimed that patents are essential to attract venture capital investment—the lifeblood of most early-stage companies. But we have little empirical evidence about whether this is true. Using a quasi-experimental approach, I investigate whether the grant of a patent makes a “business methods” software startup more likely to attract VC investment. In contrast to prior research, I find no evidence that patents play a role in channeling investment to these startups, nor that they lead to more successful downstream outcomes such as acquisitions and IPOs.
Douglas Almond
Professor of Economics and International and Public Affairs, School of International and Public Affairs
Talk Title: Methane Spikes when US LNG Unloaded in Europe
Abstract: Methane accounts for 30% of the increase in global temperatures since pre-industrial times (United Nations, 2021). We document a previously-unidentified source of methane emissions, as captured by the TROPOMI instrument on the S5-P satellite. Matching the exact date and location of 1,131 LNG shipments originating from the United States 2019-2022 to daily 0.01 degree by 0.01 degree methane concentration data, we conduct an event study analysis that reveals a systematic increase in methane concentrations immediately after US LNG exports arrive in port. LNG exports from the US release more than twice the methane as non-US exports. Methane concentrations in Europe – the destination of 64% of US LNG exports – increase 5.6 ppb when US LNG arrives, or one quarter of a standard deviation. Unburnt methane causes 86 times more global warming than an equivalent amount of CO2 over a 20 year period, an effect that does not depend on where the methane is released. Given updated estimates of methane leakage, further expansions of US LNG export infrastructure — not least the Calcasieu Pass 2 (CP2) facility being considered by the Biden Administration — should be expected to accelerate global warming. Meanwhile, rapid growth between 2022 and 2026 in solar and wind generation capacity within Europe imply that CP2’s first LNG shipments in 2026 will be dwarfed by non-fossil alternatives. Authors: Douglas Almond, Xinming Du, Maya Norman, and Anna Papp
Bianca Howard
Assistant Professor, Department of Mechanical Engineering, Columbia Engineering
Talk Title: Decision-making for Building Decarbonization
Abstract: To address climate change, we need to transition to energy efficient, energy flexible, and low carbon buildings. We need energy efficient buildings to reduce the energy needs of the sector, energy flexible buildings to enable better integration with renewables based grid, and low carbon buildings to actively switch our carbon intensive heating systems to low carbon sources. To enable this transition, we need decision-making tools to aid the complex choices around which buildings should incorporate which retrofit measures at which cost. We also need intelligent buildings to decide the best operational schemes to deliver comfortable, flexible and grid resilient buildings. This talk will discuss the ongoing work in the Building Energy Research Laboratory to tackle these challenges using building physics, optimization and machine learning.
Moderator: Kate Ascher
Paul Milstein Professor of Professional Practice of Urban Development in the Faculty of Architecture, Planning, and Preservation
Data Science Day is made possible by the support of the DSI Industry Affiliates Program.