Data Science Day 2019

Wednesday, April 3, 2019

Event Stats

  • 720+ attendees
  • 36 posters exhibited
  • 11 Interactive demos

Read our Recap

All speakers and their respected roles/titles are accurate to time of the event (2019)

2019 Keynote Speaker

Brad Smith, President and Chief Legal Officer, Microsoft

Biography: Brad Smith is Microsoft’s president and chief legal officer. In this role Smith is responsible for the company’s corporate, external, and legal affairs. He leads a team of more than 1,400 business, legal and corporate affairs professionals working in 55 countries. These teams are responsible for the company’s legal work, its intellectual property portfolio, patent licensing business, corporate philanthropy, government affairs, public policy, corporate governance, and social responsibility work. He is also Microsoft’s chief compliance officer. Smith plays a key role in representing the company externally and in leading the company’s work on a number of critical issues including privacy, security, accessibility, environmental sustainability and digital inclusion, among others.

Smith joined Microsoft in 1993, and before becoming general counsel in 2002 he spent three years leading the Legal and Corporate Affairs (LCA) team in Europe, then five years serving as the deputy general counsel responsible for LCA’s teams outside the United States.

2019 Lightning Talks

Session I: Data Science Foundations: Today & Tomorrow

Michael Collins
Vikram S. Pandit Professor of Computer Science, Columbia University

Talk Title: Successes and Challenges in Neural Models for Speech and Language

Abstract: In recent years there has been dramatic progress in key problems in speech and natural language processing (NLP), largely driven by neural methods. In this talk Collins will describe a sequence of NLP/speech problems and neural architectures of increasing complexity. Collins will detail the successes of these approaches and also the (many) questions that they raise.

Liam Paninski
Professor of Computer Science, Columbia University

Talk Title: Neural Data Science

Abstract: The neural coding problem is perhaps the fundamental question in systems neuroscience. Given some input stimulus or movement, or thought, etc., what is the probability of a neural response? In other words, what is the neural code? Modern multi-neuronal recordings produce single-cell-resolution data on a large scale. Neural data science aims to extract meaning from the resulting huge new streams of data. This lecture will highlight some recent progress and current challenges in this rapidly growing field, where new methods for network analysis, dimensionality reduction, and optimal control — developed in lockstep with advances in experimental neurotechnology — promise breakthroughs in solving multiple fundamental neuroscience problems.

Tim Roughgarden
Professor of Computer Science, Columbia University

Talk Title: Studying Auctions for Online Advertising and Pricing in Thin Markets

Abstract: Auctions for online advertising power the business models of many big tech companies such as Google and Facebook. How should such auctions set prices for ads?  This problem is particularly challenging in thin markets with a relatively small number of competitors.  Professor Tim Roughgarden will discuss research on data-driven approaches to meeting these challenges.

Shipra Agrawal
Associate Professor of the Department of Industrial Engineering and Operations Research, Columbia University


Session II: How AI is Changing Industry

Simona Abis
Assistant Professor of Business, Columbia Business School

Talk Title: Man + Machine: The Future of Labor and Knowledge Production

Abstract: Technological advancement has always been at the core of the innovation and development of most industries. From the industrial revolution to our days this has been strongly intertwined with the demand for labor and the skills required from the labor force. The current technological disruption, due to the advancements in AI and computing power, is no different. In order to understand the economic implications of these advancements, we must take into account the profit maximizing motives of firms and how these technologies might change their needs, incentives, and decision-making process.

Nima Mesgarani
Associate Professor of Electrical Engineering, Columbia University

Talk Title: Brain-controlled Assistive Hearing Technologies: Challenges and Opportunities

Abstract: Listening in noisy and crowded environments is exceptionally challenging for hearing-impaired listeners. Assistive-hearing devices can suppress certain types of background noise, but they cannot help a user attend to a single conversation amongst many without knowing which person is speaking. Recent advances in scientific discoveries of speech processing in the human auditory cortex have motivated several new paths to enhance the efficacy of hearable technologies. These possibilities include speech neuroprosthesis, which aims to establish a direct communication channel with the brain, auditory attention decoding – where the similarity of a listener’s brainwave to the sources in the acoustic scene is used to identify the attended source, and increased speech perception using electrical brain stimulation. In parallel, the field of speech signal processing has recently seen tremendous progress due to the emergence of deep learning models, where even solving the “cocktail party problem” is no longer out of reach. Nima Mesgarani, will discuss the recent efforts in bringing together the latest progress in brain-computer interfaces and speech processing technologies to design and actualize the next generation of assistive hearing devices, with the potential to augment speech communication in realistic and challenging acoustic conditions.

Julian Nyarko
Postdoctoral Research Scholar in the Faculty of Law

Talk Title: Corporate Climate: Using Machine Learning to Assess Climate Risk Disclosures and Susceptibility

Abstract: The risks associated with climate change are becoming increasingly relevant to investors. However, while the Security Exchange Commission mandates the disclosure of climate risks by public registrants, whether these companies actually make adequate disclosures has been difficult to verify. We leverage recent advancements in text analysis and machine learning to identify climate risk disclosures in corporate filings. We then create an objective framework for assessing which companies should be making these disclosures. By comparing companies that are informing about climate change-related risks to those that should be, we are able to gain insights into the effectiveness of the current regulatory framework.

Garud N. Iyengar
Tang Family Professor of Industrial Engineering and Operations Research, Columbia Engineering


Session III: A Private, Secure, & Safe World

Ronghui Gu
Assistant Professor of Computer Science, Columbia University

Talk Title: Towards Building Trustworthy Blockchain Ecosystems

Abstract: Blockchain ecosystems are built based on trust. Some people call it a “consensus,” some people call it a “belief.” However, the codes written to implement such blockchain ecosystem are not trustworthy due to program bugs. Gu’s work is focused on making software systems reliable and secure through the use of a mathematical model known as formal verification. As the backbone of modern software systems, operating system (OS) kernels impact the reliability and security of today’s computing hosts. OS kernels, however, are complicated and prone to bugs. In the past several years, Gu has designed and developed CertiKOS, the world’s first formally verified, concurrent OS kernel, proven to be bug-free and hacker-resistant. Gu uses CertiKOS and applies formal verification techniques to build trustworthy software that has applications to many technologies including blockchain systems. He will discuss why his research is considered a significant scientific breakthrough as well as a giant leap for blockchain technology.  

Mark Hansen
David and Helen Gurley Brown Professor of Journalism and Innovation; Director David and Helen Gurley Brown Institute of Media Innovation, Columbia University

Talk Title: To Reduce Privacy Risks, the Census Plans to Report Less Accurate Data

Abstract: When the Census Bureau gathered data in 2010, it made two promises. The form would be “quick and easy,” and, “your answers are protected by law.” But mathematical breakthroughs, easy access to more powerful computing, and widespread availability of large and varied public data sets have made the bureau reconsider whether the protection it offers Americans is strong enough. The Census Bureau has decided to enforce stronger privacy protections than companies like Apple or Google had when they each first took up differential privacy. To preserve confidentiality, the bureau’s directors have determined they need to adopt a “formal privacy” approach, one that adds uncertainty to census data before it is published and achieves privacy assurances that are provable mathematically. Guaranteeing people’s confidentiality is critical and increasingly challenging, but some scholars worry that the new system will impede research.  Hansen will discuss the pros and cons from both perspectives.

Tamar Mitts
Assistant Professor of International and Public Affairs, Columbia SIPA

Talk Title: Global Radicalization in an Internet Age

Abstract: Between 2011 and 2016, the Islamic State successfully convinced tens of thousands of individuals around the world to join its ranks. Many attribute this surge in foreign recruits to sophisticated internet media campaigns developed by the group since 2011. Yet, there is currently very little empirical analysis of what was ‘marketed’ in ISIS’s propaganda, what messages resonated with potential recruits, and what types of content were more likely to radicalize. Employing information on network connections, we find that propaganda messages relating to grievances, ideology, and the material and social desires of potential recruits were highly effective at increasing online support for ISIS. Strikingly, however, we find that these messages became largely ineffective when propaganda included brutal violent scenes. These findings suggest that what attracted individuals to ISIS was not the violent content that made the group so famous, but the messages in its propaganda that conveyed the material and spiritual benefits of recruitment.

Ester Fuchs
Tang Family Professor of Industrial Engineering and Operations Research, Professor of International and Public Affairs and Political Science; Director, Urban and Social Policy Program, Columbia SIPA


Session IV: Improving Patient Outcomes Through Data Science

Andrea Baccarelli
Leon Hess Professor of Environmental Health Sciences; Chair, Department of Environmental Health Sciences, Columbia University

Talk Title: Data Science and Epigenomics – Solving 21st Century Public Health Challenges

Abstract: Epigenomics is the study of the programming and changes in gene expression that does not depend on the DNA sequence. Remarkably, the human epigenome is a flexible, environmentally sensitive component of human biology that changes over time. This has been used in the field, including in our lab, in the attempt to develop new biosensors of environmental exposures and lifestyle. We have been mining epigenomics data to develop algorithms that can reveal someone’s (“true”) biological age, as well as to predict whether someone is a smoker or not, and if a smoker, how many cigarettes they smoked during their lifetime. Baccarelli will present possible applications: for instance, he has developed a biosensor of exposure to toxic lead, which can estimate with a single drop of blood lifetime exposures to lead. Baccarelli will also discuss how in the future data science, coupled with molecular biology, can open new public health and commercial opportunities.

Carri W. Chan
Associate Professor of Business, Columbia Business School

Talk Title: An Examination of Early Transfers to the ICU Based on a Physiologic Risk Score

Abstract: Unplanned transfers of patients from general medical-surgical wards to the Intensive Care Unit (ICU) can occur due to unexpected patient deterioration. Such patients tend to have higher mortality rates and longer lengths-of-stay than direct admissions to the ICU. As such, the medical community has invested substantial efforts in the development of patient-risk scores with the intent to identify patients at risk of deterioration. In this work, Chan considers how one such risk score could be used to trigger proactive transfers to the ICU. Chan utilizes a retrospective dataset from 21 Kaiser Permanente Northern California hospitals to estimate the potential benefit of transferring patients to the ICU at various levels of patient risk of deterioration. In order to reduce the sensitivity of the findings to key identification and modeling assumptions, she uses a combination of multivariate matching and instrumental variable approaches. Using the empirical results to calibrate a simulation model, it was found that proactively transferring the most severe patients could reduce mortality rates and lengths-of-stay without increasing other adverse events; however, proactive transfers should be used judiciously as being too aggressive could increase ICU congestion and degrade quality of care.

George M. Hripcsak
Vivian Beaumont Allen Professor of Biomedical Informatics; Chair, Department of Biomedical Informatics, Columbia University

Talk Title: Steering Medical Therapy Through Large-Scale Clinical Data

Abstract: Doctors frequently have questions about what is the best drug to use. Or what side effects might appear from that drug. Or whether prescribing two drugs together will cause a problem. Yet, the vast majority of questions like these have gone unanswered. Today, however, medical records and insurance data make it possible to answer these questions, with the chance that it is possible to get the wrong answer. If, for example, healthier patients take one drug compared to another, then the first drug may appear to work better. The Observational Health Data Sciences and Informatics (OHDSI) initiative applies advanced data-science techniques to avoid such errors. OHDSI is an interdisciplinary and international collaborative with a coordinating center at Columbia University. With half a billion patient records, OHDSI conducts federated studies at sufficient scale to answer questions about diagnosis and treatment. This talk will illustrate OHDSI’s approach and discuss how its studies have provided significant insights on treatment pathways for chronic diseases around the world.

Lena Mamykina
Associate Professor of Biomedical Informatics, Vagelos College of Physicians and Surgeons, Columbia University


Select Photos

Jeannette M. Wing
Brad Smith
Brad Smith
The Audience at Columbia’s Alfred Lerner Hall
2019 Poster and Demo Session
2019 Poster and Demo Session
2019 Poster and Demo Session
2019 Poster and Demo Session


DSI Industry Affiliates have access to Data Science Day recordings after the event. If you are a current DSI Industry Affiliate please contact us at for a link to the videos.

Thank You

DSI Industry Affiliates Program