WebTables Stands the Test of Time

October 4, 2018

VDLB Journal Article Wins 10-Year Best Paper Award for Co-Author Eugene Wu and colleagues

Technology moves fast. Today’s hot news quickly becomes tomorrow’s cliché. So it’s significant when a research paper not only endures but continues to reverberate. That’s the case with the 2008 paper “WebTables: Exploring the Power of Tables on the Web,” of which Columbia’s Eugene Wu was a co-author. The paper was recently honored with the “10-Year Best Paper Award” by the VDLB Journal, which selects an article from the VLDB Proceedings from ten years ago, that has “best met the ‘test of time’, that is, that has had the most influence since its publication.” Wu’s co-authors on the paper were Michael Cafarella, now at the University of Michigan; Alon Halevy, now at Megagon Labs; Yang Zhang, MIT; and Daisy Zhe Wang, now at the University of Florida.

Ten years ago, Wu had just completed his undergraduate degree at Berkeley and was working at Google for a year before heading off to the Ph.D. program at MIT. Wu, who is now Assistant Professor of Computer Science at Columbia and co-chair of the Center for Data, Media, and Society at Columbia’s Data Science Institute, notes that in 2008 Google had one of the largest corpuses of internet pages that it regularly crawled. Wu and his Google colleagues were particularly interested in the structured data contained in tables. Until that time, large-scale collections of tables simply did not exist.

The release of Google’s MapReduce framework made it possible to go big, and to process all of the internet pages that Google had crawled. So Wu and his colleagues asked the question: “Can you create a large-scale extraction and analysis of tabular data available on the internet?” He notes that “the power of databases is to store structured data in a way that can answer complex questions the user may have. This forms the foundation of most data applications today. However, web pages were simply text, and we’d need to extract tables from them.”

Today, we almost take for granted the power of identifying and combining data in tables. For example, you could combine a table that gives information about the incidence of tuberculosis in Asia, with similar tables from other continents and get a global picture of the disease. But that was not the case in 2008 since most tables existed as text in a different webpage.

The team’s first order of business was to try extracting all data bearing the HTML “TABLE” tag says Wu. “We simply wanted to know: Can this work?” By leveraging the MapReduce framework and simple HTML parsing, the team extracted 14 billion HTML tables. Most of those did not contain data: they were tabular layouts, calendars and other types of information graphically arrayed along a grid. But about 1% of those objects, or about 154 million items were tables that did indeed contain data.

The researchers then went even further. They created two tools that make the assemblage of data useful. One tool suggests the attributes that should be included in a particular table. For example, If you input a company’s stock symbol, the tool will suggest you add “Company, rank, and sales,” as headers on the table. Another tool recognizes similar attributes even if they have different names on different tables. That is, the tool recognizes that “hr” is the same as “home runs” on tables with baseball statistics. Tools such as these allow researchers to combine datasets from different tables. The work also increased the utility of search engines. Now, if you ask the average temperature in Berkeley, for example, the answer will likely come from a table, notes Wu. Indeed, it is these contributions that make the work so durable and earned the team the “test of time” award, says Gerhard Weikum, Chair of the 2018 VLDB Endowment Awards Committee. The paper describes “ground-breaking and highly influential work on making ad-hoc web tables an asset for search engines,” he says.

But the work is not finished. The authors “believe there are still tremendous opportunities around extracting and manipulating structured data on the Web. Indeed, we think the next decade holds even more promise for WebTables-style work than the last,” write Wu, Cafarella, Halevy, Wang, and Hongrae Lee, Jayant Madhavan, and Cong Yu at Google in their retrospective analysis of the work: Ten Years of WebTables.

— Robert Florida