When you mix a researcher, a massive online encyclopedia and a supercomputer, the result is a collection of insights and visualizations into what Wikipedia looks like mapped across time and space. In a partnership with high-end computing vendor SGI (s sgi), University of Illinois researcher Kalev Leetaru was able to mine the entire corpus of Wikipedia posts and make some interesting discoveries along the way.
If there’s a report detailing Leetaru’s findings, I haven’t been able to find it, but even this snippet from the project’s Facebook page (s fb) is pretty insightful:
From this analysis, Wikipedia is seen to have four periods of growth in its historical coverage: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era) and its continued growth appears to be focused on enhancing its coverage of historical events, rather than increased documenting of the present. The average tone of Wikipedia’s coverage of each year closely matches major global events, with the most negative period in the last 1,000 years being the American Civil War, followed by World War II. The analysis also shows that the “copyright gap” that blanks out most of the twentieth century in digitized print collections is not a problem with Wikipedia where there is steady exponential growth in it’s coverage from 1924 to today.
Leetaru also visualized the findings and created a couple of 30-second videos showing Wikipedia coverage and sentiment over time and geopgraphy. His visualizations (like the one above mapping “[e]very year from 1000 AD to 2012 referenced in Wikipedia plotted and cross referenced when mentioned in the same article”) are beautiful as works of art, although one can’t readily decipher who the influential people, organizations and years are.
Still, the project is a valuable reminder of just how far we’ve come in terms of data-analysis techniques and the computing power necessary to run them. This is why the idea of big data is so popular, even if the possibilities haven’t been fully realized yet. Analyses that would have taken weeks or days now take hours, minutes or seconds, which means anyone with the right data and the right gear can learn a heck of a lot if they can keeping coming up with good questions.