Data Science Day 2017
Wednesday, April 5, 2017
5:00 am - 12:30 pm
Wednesday, April 5, 2017
5:00 am - 12:30 pm
All speakers and their respected roles/titles are accurate to time of the event (2017)
Opportunities and Perils in Data Science
Over the last few decades, empiricism has become the third leg of computer science, adding to the field’s traditional bases in mathematical analysis and engineering. This shift has occurred due to the sheer growth in the scale of computation, networking and usage as well as progress in machine learning and related technologies. Resulting data-driven approaches have led to extremely powerful prediction and optimization techniques and hold great promise, even in the humanities and social sciences. However, no new technology arrives without complications: In this presentation, I will balance the opportunities provided by big data and associated A.I. approaches with a discussion of the various challenges. I’ll enumerate ten categories including those which are technical (e.g., resilience and complexity), societal (e.g., difficulties in setting objective functions or understanding causation), and humanist (e.g., issues relating to free-will or privacy). I’ll provide many example problems, and make suggestions on how to address some of the unanticipated consequences of Big Data.
Bio: Alfred Spector is Chief Technology Officer and Head of Engineering at Two Sigma, a firm dedicated to using information to optimize diverse economic challenges. Prior to joining Two Sigma, Dr. Spector spent nearly eight years as Vice President of Research and Special Initiatives, at Google, where his teams delivered a range of successful technologies including machine learning, speech recognition, and translation. Prior to Google, Dr. Spector held various senior-level positions at IBM, including Vice President of Strategy and Technology (or CTO) for IBM Software and Vice President of Services and Software research across the company. He previously founded and served as CEO of Transarc Corporation, a pioneer in distributed transaction processing and wide-area file systems, and he was a professor of computer science at Carnegie Mellon University. Dr. Spector received a bachelor’s degree in Applied Mathematics from Harvard University and a Ph.D. in computer science from Stanford University. He is an active member of the National Academy of Engineering and the American Academy of Arts and Sciences, where he serves on the Council.
Ramifications of our connected world on security and privacy. Leveraging our connected world for data driven policy. How is this changing Journalism?
Suman Jana
Associate Professor of Computer Science, Columbia Engineering
Talk Title: Security and Privacy in a Hyper-Connected World
Abstract: From smartphones to cameras, most modern computing devices are connected to the Internet, and fitted with multiple high-bandwidth sensors that collect potentially sensitive data about their user’s environment. This abundance of contextual information allows app developers to improve the technology, but poses significant security and privacy risks to individuals. Making matters worse, most devices support untrusted third-party applications downloaded through online software markets i.e. “app stores”.
Susan McGregor
Associate Research Scholar, The Data Science Institute
Talk Title: Privacy Protections Make Us More Secure
Abstract: Privacy and security are often presented as mutually exclusive, with government agencies framing accessible encryption and other privacy-preserving technologies as national security threats. Yet in a world of ubiquitous networked computing, these technologies are essential to protecting basic constitutional and human rights, as well as America’s economic and political integrity. Journalists depend on these technologies to protect sources and colleagues, so that they can report freely and independently on the news of the day. I will discuss the myriad ways in which stronger privacy also strengthens security for journalists, academics and Americans more broadly.
Hollie Russon-Gilman
Columbia SIPA
Talk Title: How Civic Tech Feeds Urban Innovation
Abstract: New technologies are increasingly helping ordinary people make their cities more innovative and responsive to public needs. By strengthening democratic participation, civic technology, or civic tech, offers new ways of giving citizens access to government decision making. In this talk, I will discuss three ways that civic tech is fostering collaborative governance in U.S. cities—through innovation units, open data and crowdfunding. I will explain what appears to be working and why, and how greater digital literacy and Internet access can amplify these trends.
Laura Kurgan
Professor of Architecture, Planning and Preservation, Graduate School of Architecture Planning and Preservation
(Moderator)
New hardware and software are being developed to expand data science tools across a broad number of Industries.
Andrew Gelman
Higgins Professor of Statistics and Professor of Political Science, Faculty of Arts and Sciences
Talk Title: From Public Opinion to Probability Theory and Back
Abstract: What does the theory of stochastic processes have to do with public opinion? It goes like this. We learn about opinion through surveys. Most people won’t respond to a survey so we need to adjust our sample to match the population. To do this right, we need to adjust for many variables. This adjustment requires models with many parameters. To fit such a model involves exploring the needle of good fit within the haystack of possible parameter values. This exploration is performed most efficiently using programs such as Stan that use gradients and other measures of the geometry of the zone of parameter space that is consistent with data and prior information. Advanced mathematics is required to develop the fitting algorithms that use gradient information. Also the fitting should be fast: our models are all wrong, so we want to be able to fit lots of models and use graphical tools to explore the fit to data.
Stefano Fusi, Professor of Neuroscience, Vagelos College of Physicians and Surgeons; and Mingoo Seok, Associate Professor of Electrical Engineering, Columbia Engineering
Talk Title: Brain-Inspired Learning Machines
Abstract: Deep neural networks, or networks with multiple hidden layers, are revolutionizing machine learning and cognitive computing with their ability to classify images, videos and speech as rapidly and accurately as humans. The algorithms for training a deep neural network, however, are complex and require huge computational resources to tune the millions of parameters used in the training process. One way to minimize computing time is to limit each parameter’s precision from 32 bits to 1 bit. Though this reduces parameter precision, classification performance remains strong, theoretical studies show. Looking to the brain for inspiration, we are developing new methods that build on this approach.
Associate Research Scientist, Data Science Institute
Talk Title: The Rise of Open Source Software for Data Science
Abstract: The recent surge in data science applications was enabled not only by the availability of data, but also free and open software tools for processing and analyzing data. Open source projects, primarily in the R and Python programming languages, have been the backbone of most recent work in data science. The scikit-learn project in particular has become the go-to solution for machine learning algorithms in many areas. I will discuss the scikit-learn library, its scope and development, and raise questions about current model of open source tools based on volunteer labor.
John Wright
Associate Professor of Electrical Engineering, Columbia Engineering
(Moderator)
Data science is helping to diagnose, treat and prevent a range of diseases. One area of innovation has come from data collected and provided by patients themselves. We’d like to touch on some novel applications here.
Noémie Elhadad
Associate Professor of Biomedical Informatics, Vagelos College of Physicians and Surgeons
Talk Title: Citizen Endo: (Citizen + Data) Science for Endometriosis
Abstract: Ten percent of women are thought to suffer from endometriosis, a chronic condition associated with infertility and painful menstrual cycles. Despite its prevalence, we still know surprisingly little about how endometriosis develops, evolves, and how to best treat it. To learn more, my colleagues and I recently launched Citizen Endo, a crowdsourcing project that gathers data directly from endometriosis patients via a mobile phone app we developed. While most research so far has focused on physical manifestations of the disease, Citizen Endo, which now has more than 1,500 participants, aims to gain a more systemic understanding by analyzing the day-to-day experiences of women living with the condition.
Ying Cheun
Professor of Biostatistics, Mailman School of Public Health
Talk Title: Using Patient-Generated Data to Find the Best Health App
Abstract: Adaptive design is a methodology used in clinical trials that helps researchers compare different drug treatments based on interim observations of trial participants. Its interactive nature also makes it a good framework for evaluating dynamic health apps which evolve quickly over time. In this talk, I will show how adaptive design can be applied to app monitoring and recommendation within an ecosystem of health apps, and introduce SMART-AR, a framework for analyzing data in this ecosystem. I will discuss the analytics we are developing for Android mental health apps in a collaboration between Columbia and Northwestern universities.
Kenrick Cato
Assistant Professor of Nursing, School of Nursing
Talk Title: Identifying At-Risk Patients from Nursing Notes
Abstract: More than 200,000 patients die in U.S. hospitals each year from cardiac arrest, and more than 130,000 patients die of sepsis, a deadly immune response to bacterial infection. Many patient deaths could be prevented if the warning signs could be caught sooner. Our research suggests that the notes taken by nurses periodically describing their patient’s condition can provide powerful clues as to which patients need extra oversight. I will discuss a project that I am leading at Columbia, Communicating Narrative Concerns Entered by RNs (CONCERN), in partnership with several other teaching hospitals, to automate the analysis of nursing notes in patient electronic health records. Our aim is to design and evaluate clinical decision support system to identify words in nursing notes that best predict a life-threatening health problem. This project is scheduled to begin in June.
Itsik Pe’er
Associate Professor of Computer Science and Systems Biology, Columbia Engineering
(Moderator)
Data science is helping to remove the middle man from many industries, from finance to the media to transportation. An example is driverless cars and other changes in the transportation sector.
Costis Maglaras
David and Lyn Silfen Professor of Business and Dean, Columbia Business School
Talk Title: How Ride-hailing Platforms Optimize Performance
Abstract: Ride-hailing platforms such as Uber and Lyft match passenger demand to driver supply over a geographic network. They have at their disposal several control capabilities, such as how to prioritize passengers based on their destinations; when and which passenger requests to reject; and when and where to direct idling drivers on the network. I will discuss the impact of these type of matching and capacity optimization decisions to the network’s performance, and show that that they lead to significant improvements in most scenarios and work particularly well during the morning and evening rush hour when passenger flows across the network are most unbalanced.
Sharon Di
Assistant Professor of Civil Engineering and Engineering Mechanics, Columbia Engineering
Talk Title: A New Perspective for Shared Mobility Systems
Abstract: Ride-sharing and other shared mobility systems promise to reduce traffic, energy consumption, and the amount of time and money we spend traveling. Many of these benefits, however, remain unproven. If policymakers are to develop effective ride-sharing policies, they need to know how travelers are redistributed while using the system. In this talk I will discuss new research findings on ride-sharing that can be generalized to other shared mobility systems. This work is based on a mathematical model for predicting how travelers move across a network of roads. Early results suggest that high-occupancy toll lanes, proposed as one way to encourage carpooling and ease congestion, may be ineffective if poorly planned.
Eric L. Talley
Isidor and Seville Sulzbacher Professor of Law, Columbia Law School
Talk Title: A Machine Learning Classifier for Fiduciary Duty Waivers
Abstract: The SEC requires publicly-traded companies to periodically disclose important information to the public, but predicting how and where companies will file these disclosures is extremely difficult. I will discuss a supervised learning strategy to analyze one type of SEC disclosure filing—fiduciary duty waivers, signed by company officers and directors waiving their fiduciary duties to shareholders. Using a lawyer-coded training set of these waivers, we calibrated a predictive classifier and extended it to the entire SEC database. The results of simulated out-of-sample Monte Carlo are strong using conventional evaluation criteria. We find mildly positive effects on securities market prices, suggesting that the waiving of fiduciary duty by company officers and directors is not received as a negative signal by investors and capital market participants.
Paul Glasserman
Jack R. Anderson Professor of Business, Columbia Business School
(Moderator)
DSI Industry Affiliates have access to Data Science Day recordings after the event. If you are a current DSI Industry Affiliate please contact us at datascience@columbia.edu for a link to the videos.