A History of Film in Wikipedia

What story does the link structure of Wikipedia tell about the history of film? My timeline represents the most important movies of the last 100 years – generated via a link analysis algorithm on Wikipedia. The blue bars correspond to influence labelled with the most important movie of that year.


In a nutshell, “influence” represents the chance that an infinitely bored web surfer will be browsing the page after surfing for an long period of time. For technical folks, this is the PageRank calculated from intra-Wikipedia links. See my post on ranking universities using Wikipedia for more details.

Using movie infoboxes from Wikipedia, I’ve identified the release year for all movies. The influence of the highest scoring film for that year gives the shape of the blue bars on the left side of the graph.

The visualization is coded in d3.js and the influence code is my custom WikiRank project (coded in go!). Feel free to fork it.

A couple of interesting tid-bits from the graph:

  • Titanic Titan: Titanic is the most influential movie of the few movies released in my lifetime and almost matches Star Wars IV. Remember that the graph represents influence and not quality.
  • World War II: The early war years were an amazing time in film. Starting with The Wizard of Oz in 1939, there is a marked increase in influence which persists until just after America joins the war. More classics: Citizen Kane, Casablanca and Fantasia!
  • The early years: There are very few blockbusters pre-1930. There are two notable exceptions: in 1927, Metropolis forms a solitary peak and in 1915 excels with The Birth of a Nation. Drilling deeper, Metropolis actually represents 31% of the film influence mass for 1927.

To crunch the data for yourself, I’ve made a machine-readable list of influence and films available online.

Find something interesting? Visualization ideas? Comment or tweet @cosbynator