One Hour Wikipedia

Channeling both Chomsky and Norvig I routinely consult Wikipedia by reading it in its entirety to answer questions no one had thought to pose.

Wikipedia's markup rules exert structure on natural language. Each speech act by an author is followed by a check that the parser has properly interpreted the extra symbols one must know to write for the encyclopedia. It thus becomes a gigantic corpus of semi-structured text.

Wikipedia's markup rules allow one human to speak to another without regard for the computer's ability to understand. Each speech act is followed by relentless revision where other humans in turn attempt to improve the utility of what has been said. It thus becomes a gigantic corpus of semi-unstructured text of some utility.

In order to avail myself of the utility within Wikipedia and similar texts where rules have been partially enforced I have developed a parsing technology and related methodology I call Exploratory Parsing.

# Process

I explore by posing somewhat vague questions, finding unexpected answers, and then refining and repeating my questions accordingly. I am excited by the regularities I find and by the rare exceptions.

I pose my questions using a rigorous syntax based on decades of technology developed by computer scientists for parsing computer programs but still rooted in Chomsky. Recent innovations by Bryan Ford (2004) allow me to apply the efficiencies within to texts as large as Wikipedia. pdf

I once studied how company names relate to their domain names by parsing wikipedia's infoboxes.

I once studied how wikipedians used videos by parsing all variations of video file and player embeds.

I once studied how street addresses had been written in order to transform them to use a new format.

I once studied copyright declarations and found by accident how many claimants couldn't spell copyright.

Although my results are counts, like Norvig's n-grams, I look at each and every character in the light of the query of the moment. My results are not statistical.

# Speed

It takes me an hour to parse all of Wikipedia on a laptop which is comparable to just copying it.

I have devised mechanisms that let me reduce this cycle time for exploring to minutes or seconds.

I stream results from the parser as it parses. Simple mistakes in my queries surface in seconds, trends in the data surface in minutes.

I save usage samples for every grammatical term so that I can understand what I find in context.

I exploit the nested contexts provided by parsing to construct representative subsets of the corpus appropriately diverse while 100 to 1000 times shorter.

I tune my queries using these shortcuts then unleash them on the whole corpus to get a fresh subset or final result.

# Impact

Although I have promoted Ford's work within my communities with some success I have yet to see anyone else use exploratory parsing at all despite offering to help write queries to get people started.

I attribute this to the bad experience most programmers have had with grammars in their university compiler construction classes. Once held as a shining example of theory, parser construction became a small and neglected part of a students one experience writing a large program. Now even faculty shrug off this technology observing that very few graduates ever write a compiler. What a shame parsing and compiling have been thus conjoined.