Columbia University Awarded $14 Million Grant to Develop Computer System That Can Translate and Summarize Documents from Different Languages into English

November 10, 2017

A research team led by Kathleen McKeown, founding director of the Data Science Institute, is designing a computer system that can process, translate and summarize documents from different languages into English.

The system, called SCRIPTS, which stands for System for Cross Language Information Processing,Translation and Summarization, will be capable of transcribing speech from sources such as videos, news broadcasts, and video content from social media. It will also transcribe text documents such as newspapers, reports and social media posts. Given a query, the human-language technology system will retrieve documents in languages with relatively low presence on the internet (called low-resource languages) such as Hausa and Uyghur. It will find and translate the documents into English while producing short summaries of relevant portions for the user.

The team will develop automatic speech-recognition components that will work across a range of languages and fields. An English query can be submitted and it will be translated into a foreign language and matched against the stored foreign language documents. Relevant documents will then be retrieved, translated and summarized into English. The system will also use other strategies such as matching an English query against translated documents or summarizing foreign-language documents and then translating the summary.

The research is supported by a $14 million grant from Intelligence Advanced Research Projects Activity (IARPA), the federal organization that invests in research to support the U.S. intelligence community through the MATERIAL program. Intelligence analysts study activity in countries all around the world and must read copious documents in many foreign languages. As it is now, analysts must wade through documents manually or use a computer system unable to translate uncommonly spoken languages into English. And current software systems don’t provide good translations of low-resource languages. SCRIPTS, however, will provide a full end-to-end system for information retrieval, translation and summarization. In so doing, it will be of immense help to intelligence analysts, says McKeown, who is the principal investigator on the project.

“Intelligence analysts have come to meetings and told us precisely the kind of system they need to be more efficient,” adds McKeown, a professor of Computer Science in Columbia Engineering and an affiliate of the Data Science Institute. “We are designing this computer system to help analysts, who work in every corner of the world, translate and wade through an immense amount of documents related to their work, allowing them to do their research faster and more effectively. It’s an exciting project that combines many overlapping technologies.”

Some of those technologies include machine learning, cross-language information retrieval (CLIR), neural networks, text summarization and machine translation. Other aspects of the system, such as machine translation, will be based upon translation systems that won first place in both the 2016 Conference on Machine Translation and the International Workshop on Spoken Language Translation. The approaches to CLIR and summarization, moreover, will be designed to optimally work with whatever cross-lingual resources are available, from cross-lingual phrases to full translations. In terms of computational efficiency, some 750 million words a day can be translated.

The members of the research team assembled by McKeown to develop SCRIPTS are all leading experts in their respective fields. The international team is comprised of researchers from Cambridge University, the University of Maryland, Edinburgh University, Yale and Columbia. Team leaders from Columbia include McKeown, Julia Hirschberg, Michael Collins and Smaranda Muresan; Hal Daumé III, Doug Oard, Marine Carpuat and Philip Resnik from the University of Maryland; Dragomir Radev from Yale; Steve Renals, Kenneth Heafield, Peter Bell, Rick Sennrich and Barry Haddow from the University of Edinburgh; as well as Mark Gales from Cambridge University.

“I’m humbled to have assembled such a great team,” says McKeown, who is prominent in the field of natural-language processing.

McKeown also has vast experience in developing computer systems, and is especially known for developing Columbia Newsblaster. Created in 2001, Newsblaster is an online system that automatically tracks the day’s news and offers multi-document summarization, clustering and text categorization. In 2010, she received the Anita Borg Women of Vision Award in Innovation for her work on Newsblaster. And as founding director of the Data Science Institute, she led large research projects such as DARPA GALE, IARPA FUSE, and NSF Digital Libraries. Having led all these teams, she knows a good research project when she sees one, and she has high hopes for her latest project: SCRIPTS.

“The international research team has top-notch people in their respective fields,” says McKeown. “I’m certain the team will make a significant advance in developing a state-of-the-art human language technology system that will help millions of researchers.”

— Robert Florida