Cafarella Receives VLDB Test of Time Award for Structured Web Data Search

Michael Cafarella

Prof. Michael Cafarella and his collaborators have been awarded a 2018 Very Large Databases (VLDB) 10-Year Test-of-Time Award for their paper, "WebTables: exploring the power of tables on the web." This award is given to the VLDB paper published ten years earlier that has had the most influence since its publication.

In this paper, Cafarella and co-authors Alon Halevy, Zhe Daisy Wang, Eugene Wu, and Yang Zhang set out to determine how to provide search-engine-style access to huge volumes of structured web data.

Thanks to the HTML table tag, the World-Wide Web is packed full of structured data. Cafarella and the team extracted 14.1 billion of these tables from Google's general-purpose web crawl, filtering out those used for formatting or non-relational reasons to find an estimated 154 million that contained high-quality relational data. Each table in this smaller set can be considered a small structured database, and at the time the paper was written this collection was the largest collection of web tables by at least five orders of magnitude.

Using this massive dataset, the team developed the WebTables system to examine effective methods of searching within collections of tables and how a collection this large could be used to increase search power.

The project resulted in several new techniques for keyword search over a collection of tables that achieve much higher search relevance than solutions based on a traditional search engine. The team also derived the "attribute correlation statistics database" (ACSDb) from the database collection, which records collection-wide statistics on elements in the different tables. In addition to improving search relevance, the AcsDB made possible several new applications: search term auto-complete, attribute synonym finding, and join-graph traversal, which allows a user to navigate between common terms from different tables using automatically-generated join links.

The ACSDb was the first time such large statistics have been collected for structured data design.

Posted: June 13, 2018