What story does the link structure of Wikipedia tell about the history of film? My timeline represents the most important movies of the last 100 years – generated via a link analysis algorithm on Wikipedia. The blue bars correspond to influence labelled with the most important movie of that year.
In a nutshell, “influence” represents the chance that an infinitely bored web surfer will be browsing the page after surfing for an long period of time. For technical folks, this is the PageRank calculated from intra-Wikipedia links. See my post on ranking universities using Wikipedia for more details.
Using movie infoboxes from Wikipedia, I’ve identified the release year for all movies. The influence of the highest scoring film for that year gives the shape of the blue bars on the left side of the graph.
The visualization is coded in d3.js and the influence code is my custom WikiRank project (coded in go!). Feel free to fork it.
A couple of interesting tid-bits from the graph:
Titanic Titan: Titanic is the most influential movie of the few movies released in my lifetime and almost matches Star Wars IV. Remember that the graph represents influence and not quality.
World War II: The early war years were an amazing time in film. Starting with The Wizard of Oz in 1939, there is a marked increase in influence which persists until just after America joins the war. More classics: Citizen Kane, Casablanca and Fantasia!
The early years: There are very few blockbusters pre-1930. There are two notable exceptions: in 1927, Metropolis forms a solitary peak and in 1915 excels with The Birth of a Nation. Drilling deeper, Metropolis actually represents 31% of the film influence mass for 1927.
To crunch the data for yourself, I’ve made a machine-readable list of influence and films available online.
Find something interesting? Visualization ideas? Comment or tweet @cosbynator
Can an algorithm critique movies? With my last blog article on algorithmic university rankings, I promised a series of blog posts answering these little questions. The movie industry has a rich history of analysis, much of it framing content of a movie in the context of others:
Like The Birth of a Nation and Citizen Kane, Star Wars was a technical watershed that influenced many of the movies that came after – Roger Ebert
Restricted to movies, Wikipedia is a structured map of comparative analysis. Does it hold an implicit movie ranking?
The list is hard to argue with. Citizen Kane even appears at the top of the American Film Institute’s 100 YEARS…100 MOVIES. Every film here is a household name. The rankings are a measure of influence and not necessarily quality. In particular, The Passion of the Christ, strikes me as out of place. It is ranked so highly because it was so controversial and had ramifications for the influential Mel Gibson.
My ranking is purely algorithmic and I didn’t pick any of the movies nor manipulate the rankings. The relative column measures of the importance of a movie relative to Citizen Kane. Unlike my university rankings, movies are much closer together in influence. For example, The Wizard of Oz is almost as important as Citizen Kane (not to mention Gone With the Wind!).
The algorithm boils down to a bored surfer on Wikipedia. As someone clicks around randomly in Wikipedia articles, they will click on more influential films more than we click on less influential films. I compute a number, called the PageRank, which quantifies how often a surfer will visit a certain page by clicking randomly. The PageRank is actually part of the algorithm used by Google to show you search results on the internet – only applied to movies instead of websites. The above ranking is top 25 PageRanked movies.
Try this at home
The code for this project is all available in the WikiRank repository on Github. Creating lists like these requires two steps: link analysis and selecting out movies-articles. The first step is a standard implementation of PageRank in pagerank.go. The second is a little more sophisticated – Wikipedia has an alphabetical listing of all movies ever made. By pulling down this list (see scrape_movies.py) and following obvious links (i.e. not lists, etc.) I was able to get a master list of movies. I join the master list with my influence rankings and sort it to produce a global ranking of every movie ever made.
There are hundreds of different methods to rank universities. Choosing a method is tough, because what does it mean for one university to be better than another? I was considering the question in Data Mining class and decided a good gauge is influence. For example, Harvard is ‘better’ than McGill because it has contributed more to world knowledge. And there is a great proxy for world knowledge: Wikipedia.
Wikipedia is semi-structured and machine-readable. Can Wikipedia be used to measure influence and will that measure produce meaningful ranking? Let’s see.
World University Ranking
The top fifteen universities picked by my algorithm are listed above. If you are interested in a particular university, punch it into the text box and click.
The ranking algorithm is actually a variant of Google’s: I have computed the PageRank for each Wikipedia article by following intra-wiki links. The code details are available but here’s a quick introduction.
Imagine an infinitely bored surfer, Thomas, who reads every article for exactly one second and then chooses a new article in two ways: 85% of the time Thomas clicks a random link in the current page and 15% of the time he navigates to Special:Random, which leads to a random page in Wikipedia. If Thomas browses for 1 day, how many seconds is he expected to stay on the page for Harvard University? (the answer is about 14 seconds).
Influential pages are linked to by other influential pages. Thus, Thomas will visit important pages more often than unimportant ones. The expected amount of viewing time for a page is called the PageRank. Consider another question: if Thomas is viewing a page for a university what are the chances that he is viewing Harvard? (the answer is around 2%). Formally, the answer is the random walk’s stationary distribution, conditioned on visiting a university page. Informally, it is the influence of Harvard amongst other universities.
The PageRank gives a total ordering on all universities in the world as shown in the ranking widget. The widget also has one extra column: the relative importance. The measure quantifies the influence of a university with respect to Harvard. Formally, the importance is the odds ratio between the PageRank of Harvard and the PageRank of the university.
The idea of a ranking of influence inspires question about the distribution of universities. How many Harvards are there in the world? Do the world’s top ten actually contribute that much more to knowledge?
This visualization shows a log-scaled histogram of the influence of the universities of the world. Hovering / clicking a bar will show you a few sample universities within the tier.
The distribution looks roughly normal to me, indicating that influence is distributed log-normally (although I’m not a statistician!). It indicates that influential universities are much rarer than less influential universities. These results are surprising to me – I had always assumed that there were a few elite institutions and most others were of similar influence. The widget shows that influence is actually widespread, and that there are many different “tiers” of universities (with the majority on the very low end). Who knew?
The ranking seems reasonable – the top fifteen are household names that more or less agree with rankings like the Academic Ranking of World Universities. There are a couple of interesting quirks worth calling out:
Columbia: Unlike other lists, Columbia sits above above big names like Yale and Stanford. I computed a list of the “biggest influencers” – pages that constitute most of the PageRank mass for Columbia. These are Columbia University Press (0.019%), The New York Times (0.019%), Columbia Journalism Review (0.0094%), Columbia Encyclopedia (0.0083%) and New York (0.0072%). The influence scores are relatively small since Columbia is linked by a lot of articles. In fact, I found that Columbia is frequently linked in the citation section of many articles (e.g. China). I hypothesize that Columbia appears more frequently as a link, as opposed to free text, within citation sections on Wikipedia. One could argue this is a bug or the algorithm correctly identifying a pattern.
English-speaking bias: The first university from outside the English speaking world is Humboldt University of Berlin, at 36th place. Other lists often rank places like ETH Zurich in the top twenty (down at 116th in the ranking). The feature is unsurprising since the data is only the English language Wikipedia. The problem could be “fixed” by blending in other subdomains, but academic literature probably has an English bias too.
The project is still a work in progress. Recently, I’ve been experimenting with different ways of presenting these algorithmic top ten lists. My code is all available on github (it’s messy!). The “magic” of PageRank happens in pagerank.go but there is a disgusting amount of data massaging to get the Wikipedia graph into a rankable state. Some of that is in extractgraph.go. My PageRank code is an interpretation of the pseudo-code in the wonderful Mining of Massive Datasets book (free online!).
Although PageRank gives an influence score for each page, I still have to identify all the universities of the world. This is the process of finding a master list for the education category. Unfortunately, Wikipedia’s list is too unstructured to mine easily. Instead, I scraped the 4icu.org university list. After extracting a list of names, I look them up in Wikipedia’s page titles to find my ranking. The process is made robust because of Wikipedia’s dense set of redirects that map acronyms and alternate spellings to a canonical source page.
The code operates off a giant Wikipedia XML dump. These are all available here (I used a dump from a month ago). The project was also my first foray into the Go programming language but I’ll withhold my judgement for this article.
I’m pleased with my early results but they represent a holistic view to ranking. People choose universities with a field in mind; in computer science Carnegie Melon University is regarded as world-class while my Wikipedia-scoring has it as 55th in the world. There is a PageRank variant called Topic-Sensitive PageRank (TSPR) that scores based on a subject-biased random walker. I’d love to see if using TSPR changes the rankings substantially.
Furthermore, universities are only the tip of the iceberg. My WikiRank system can produce an ordered list of anything. I’ve already started producing a list of movies (sneak preview: Citizen Kane is on top). Expect a few more linkbait top ten blog posts in the future!
I’m a Master’s student studying Artificial Intelligence at Stanford University. Notice something I missed? Post in the comments or tweet @cosbynator.