Exploring Big Data Solutions for the Northeast

Columbia To Lead $1.25 Million Research Project

As lead agency for NSF’s Northeast Big Data Hub, Columbia will bring together experts in the public and private sector to find data-driven solutions. (Darcy Peterka/Columbia University)

Columbia University will lead a $1.25 million project funded by the U.S. National Science Foundation (NSF) to share data, tools and ideas for tackling some of the big challenges facing the northeastern United States.

As lead agency for the Northeast Big Data Innovation Hub, one of four NSF-sponsored hubs, Columbia will bring together experts in the public and private sector to collaborate on data-driven solutions to problems in health care, energy, finance, urbanization, natural science and education.

Massive datasets and novel computational techniques are changing how individuals and societies approach day-to-day tasks. Data analytics promise to deliver individually tailored treatment to patients, massively reduce energy use in buildings, and radically improve teaching methods in schools, among other advances.

With 40 universities, and partners in industry, government and the non-profit sector, Columbia will identify high-priority needs in the region. A series of workshops over the next three years will give partners a chance to brainstorm and collaborate on projects that can bring about the greatest impact.

A few of the questions the Northeast Hub will address:

  • How do we encourage data sharing to maximize the potential for discovery?
  • How can open data principles be balanced against privacy and security concerns?
  • How can cities mine and share data to improve public services and adapt to climate change?
  • How can patient and environmental data be used to prevent and treat disease?

The Northeast is home to some of the oldest and most diverse cities in the United States, and many of the nation’s top universities, hospitals and banks. “It’s an ideal laboratory for testing the potential for data science to improve lives,” said Northeast Hub principal investigator Kathleen McKeown, a computer scientist at Columbia Engineering and director of the Data Science Institute. “The Northeast Hub will focus on extracting insights from large amounts of data that can bring about tangible results.”

Data analytics
The Hub will look at how data analytics can manage a flood of real-time information coming from a variety of energy sources and an increasingly local delivery system. (Vijay Modi/Columbia University)

The idea for a Big Data hub network came in 2012, after President Obama announced a $200 million National Big Data Research & Development Initiative to apply data analytics to education, environmental and biomedical research, and national security. NSF, one of six federal agencies involved, proposed an add-on initiative that would divide the country into “regional innovation hubs,” each harnessing experts in academia, industry, government, and the non-profit sector, to address problems too big for any one to take on alone.

Planning sessions were held earlier this year in four regions. On Monday, NSF announced leaders for the hubs—Columbia for the Northeast; Georgia Tech and North Carolina State University for the South; University of Illinois, Urbana-Champaign, for the Midwest; and the San Diego Supercomputer Center, University of California, San Diego, and University of Washington for the West.

The Northeast Hub includes all six New England states--Maine, Vermont, New Hampshire, Massachusetts, Rhode Island and Connecticut--as well as New York, New Jersey and Pennsylvania. General Electric, Microsoft and Ericsson are among 20 industry partners; NYC’s Office of Data Analytics, Brookhaven National Laboratory and the Regional Plan Association are among 20 government and non-profit partners.

The Northeast Hub will have six areas of focus:

Health: Led by informatics researcher George Hripcsak, chairman of Columbia’s Department of Biomedical Informatics, the group will analyze patient and biological data at scale, and examine ways of harnessing data from social media, environmental sensors and other alternative sources to deliver individualized treatment.

Data analytics is helping doctors understand medication use.
Data analytics is helping doctors understand medication use. Above, an analysis of several million patient records showed there is little agreement on the best anti-depressant to start patients on (inner circle), or what to prescribe as an alternative (outer rings). (Courtesy of George Hripcsak/OHDSI network)

Energy: Led by Abani Patra, at the State University of New York, Buffalo, the group will explore how data analytics can help manage the massive amounts of real-time information coming from an increasingly diverse energy supply (wind, solar, natural gas and other sources) and an increasingly local delivery system.

Cities and Regions: Led by urban research analyst Sanjay Seth, at the Regional Plan Association, and urban informatics researcher Constantine Kontokosta, deputy director of NYU’s Center for Urban Science and Progress, the group will look at how data analytics can improve the delivery of public services and make cities more equitable, sustainable and resilient. 

Finance: Led by computer scientist Michael Kearns, director of University of Pennsylvania’s Warren Center for Network and Data Sciences, the group will apply data analytics to our increasingly automated financial markets to understand their underlying connections and vulnerabilities.

Big Data in Education: Led by computer scientist Beverly Woolf at the University of Massachusetts, Amherst, and computer scientist Ryan Baker at Columbia’s Teachers College, the group will look at turning behavioral feedback from online courses into techniques for teaching subjects more effectively.

Discovery Science: Led by computational earth scientist Chris Hill at MIT, and computer scientist Manish Parashar, director of the Rutgers Discovery Informatics Institute, the group will look for ways to accelerate discovery in the natural sciences by applying machine learning tools and large scale hardware and software systems to massive amounts of observational data, in astrophysics, marine microbiology and materials design, among other disciplines.  

The Northeast Hub will address four overarching themes:

Education: Led by computer scientist James Hendler, director of the Rensselaer Institute for Data Exploration and Applications at RPI, the group will develop data science education materials for K-12, college, and continuing and online instruction, and with the New York Hall of Science and other organizations develop public exhibits related to data analytics.

Data Sharing: Led by computer scientist Sam Madden at MIT’sComputer Science and Artificial Intelligence Laboratory, the group will study platforms and formats for regional data sharing, including software to allow researchers to annotate and publish their own data.

Ethics and Policy: Led by digital communication researcher Jennifer Stromer-Galley at Syracuse University, and technology researchers Mark Latonero and Karen Levy at the Data & Society Research Institute, the group will focus on questions tied to the ethical collection and use of big data, including consumer and health information.

Privacy and Security: Led by computer scientist Adam Smith, at Penn State, the group will focus on how to keep data safe but accessible at a wide scale all while protecting individual privacy.

The Northeast Hub’s executive committee will be led by McKeown, the PI, Howard Wactlar, a computer scientist at Carnegie Mellon; Carla Brodley, a computer scientist at Northeastern University; Vasant Honavar, a computer scientist at Penn State; and Andrew McCallum, a computer scientist at University of Massachusetts, Amherst. A full list of partners is available on the Northeast Hub website.

The Hub will hold its first workshop on Dec. 16 at Columbia. Speakers will include counter-terrorism expert Michael Leiter, chief strategy officer at Leidos, a top U.S. defense contractor; and Keith Marzullo, director of the Federal Networking and Information Technology Research & Development program. An industry session will feature discussion of corporate data analytics and how companies in the Northeast might benefit from the Hub. A breakout session organized by leaders of the above sub-groups will focus on data available for sharing and plans for the coming year.

NSF announcement: Establishing a brain trust for data science

— Kim Martineau

550 W. 120th St., Northwest Corner 1401, New York, NY 10027    212-854-5660
©2016 Columbia University