An Exercise in Irrelevance - Data-management on the Web Scale, Alon Halevy

This is a live blog from Neuroinformatics 2009.

Data management: View from 50,000 feet — dimensions are amount of structure and the number of data sources. More structure, less data sources.

Distinguishes between parallelisation and heterogeneity. Can distribute data across tables in an organised way — this is parallelisation; or, you can have lots of data, spread across resources, with multiple entities and with no common plan.

Outline — data integration and suggest data spaces as a solution.

Databases are so successful because it provides a level of abstraction over the data. Data integration is a higher level of abstraction still because you don’t have to worry how the data is stored or structured.

Mediated schema, uses a mediation language, a mapping tool, and then a set of wrappers over the datasources, which map them to a common syntax (relational database for example).

So, we know how to do it, but the cost of building data integration systems are really high. Creating the mediated schema or ontology is hard; sometimes it’s impossible. Mapping source to mediated schema can be a nightmare, because you need many people from both sides of the mediation. Are some automated systems, but human is always needed. Data level mappings (changing IDs, synonyms and so on). Social costs.

One of the problems with data integration is that it costs a lot early, but yields very little till quite a long time on, and it’s all done. What we really want is pay-as-you-go data management; want useful data out early and constantly.

Everytime human does something with data, they are telling you some information about the data. If you can capture this information then you can useful stuff with this.

Structured data on the web: the deep web, which is data behind forms; and two others. So, deep web. Knowledge which is not accessible through general purpose search engines — cars, houses and so on are examples of this. Uses data spaces as a way of doing this; learned different 5000 data sources in two months.

One possible way to access the deep web is to put queries against web forms. Have to guess what to put in; one way is to just use words on the form page in the first place. Currently, google gives much knowledge from this deep web; has the biggest impact on the deep web.

Web tables; can we exploit the knowledge from the tables better. There are 14billion tables on the web, of which about 154million are interesting — rest formatting or whatever. First problem is to identify schema elements; these are expressible in HTML but actually no one uses it. So have to guess. They got 2.6 million schemas. Would be good to put these into automcomplete (although not sure where).

Fusion tables lets you upload data and collaborate on the visualisation of it. Changes the visualisation options depending on the data types.

Conclusions — bottom up data-integration, which is more realistic than top-down. Dataspaces are an approach. Fusion tables is good.