Norvig on Data

Peter Norvig, et al. "The Unreasonable Effectiveness of Data" in IEEE Intelligent Systems (2011) pdf


For many tasks, words and word combinations provide all the representational machinery we need to learn from text. Human language has evolved over millennia to have words for the important concepts; let’s use them. Abstract representations (such as clusters from latent analysis) that lack linguistic counterparts are hard to learn or validate and tend to lose information.

In reality, three orthogonal problems arise: choosing a representation language, encoding a model in that language, and performing inference on the model.

Semantic interpretation deals with imprecise, ambiguous natural languages, whereas service interoperability deals with making data precise enough that the programs operating on the data will function effectively. The fact that the word “semantic” appears in both “Semantic Web” and “semantic interpretation” means that the two problems have often been conflated, causing needless and endless consternation and confusion.

We know how to build sound inference mechanisms that take true premises and infer true conclusions. But we don’t have an established methodology to deal with mistaken premises or with actors who lie, cheat, or otherwise deceive.

Using a Semantic Web formalism just means that semantic interpretation must be done on shorter strings that fall between angle brackets.

So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.

For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.


The authors skewer the notion of ontologies offering any useful meaning for any but the most important domains and then only when the primary players in that domain choose to cooperate.

Clay Shirky similarly skewers Paul Otlet's Mundaneum from the '30s for an unrealizable dependence on ontology.

See Norvig on Chomsky where he similarly skewers an almost offhanded remark by Chomsky elevating models as the focus of scientific inquiry.

Federated wiki suggests a middle ground between ontology and statistics: co-evolution of datasets and the mechanisms that employ them. Google has made much of rampant but isolated digital publications. But what more could happen if that isolation is removed?