Can an algorithm critique movies? With my last blog article on algorithmic university rankings, I promised a series of blog posts answering these little questions. The movie industry has a rich history of analysis, much of it framing content of a movie in the context of others:
Like The Birth of a Nation and Citizen Kane, Star Wars was a technical watershed that influenced many of the movies that came after – Roger Ebert
Restricted to movies, Wikipedia is a structured map of comparative analysis. Does it hold an implicit movie ranking?
Film According to Wikipedia
|2||The Wizard of Oz||96.6%|
|3||Gone with the Wind||84.3%|
|5||Star Wars Episode IV: A New Hope||76.9%|
|8||Star Wars Episode V: The Empire Strikes Back||52.0%|
|13||The Passion of the Christ||46.5%|
|14||Snow White and the Seven Dwarfs||46.3%|
|15||Star Wars Episode VI: Return of the Jedi||46.0%|
|16||The Birth of a Nation||45.1%|
|17||2001: A Space Odyssey||43.7%|
|18||The Lion King||43.4%|
|19||Raiders of the Lost Ark||42.9%|
|20||The Lord of the Rings: The Return of the King||42.5%|
|21||The Dark Knight||42.2%|
|23||The Silence of the Lambs||39.3%|
The list is hard to argue with. Citizen Kane even appears at the top of the American Film Institute’s 100 YEARS…100 MOVIES. Every film here is a household name. The rankings are a measure of influence and not necessarily quality. In particular, The Passion of the Christ, strikes me as out of place. It is ranked so highly because it was so controversial and had ramifications for the influential Mel Gibson.
My ranking is purely algorithmic and I didn’t pick any of the movies nor manipulate the rankings. The relative column measures of the importance of a movie relative to Citizen Kane. Unlike my university rankings, movies are much closer together in influence. For example, The Wizard of Oz is almost as important as Citizen Kane (not to mention Gone With the Wind!).
The algorithm boils down to a bored surfer on Wikipedia. As someone clicks around randomly in Wikipedia articles, they will click on more influential films more than we click on less influential films. I compute a number, called the PageRank, which quantifies how often a surfer will visit a certain page by clicking randomly. The PageRank is actually part of the algorithm used by Google to show you search results on the internet – only applied to movies instead of websites. The above ranking is top 25 PageRanked movies.
Try this at home
The code for this project is all available in the WikiRank repository on Github. Creating lists like these requires two steps: link analysis and selecting out movies-articles. The first step is a standard implementation of PageRank in pagerank.go. The second is a little more sophisticated – Wikipedia has an alphabetical listing of all movies ever made. By pulling down this list (see scrape_movies.py) and following obvious links (i.e. not lists, etc.) I was able to get a master list of movies. I join the master list with my influence rankings and sort it to produce a global ranking of every movie ever made.
Follow up: I’ve broken the most influential movies down by year in a new post about a History of Film in Wikipedia.
Disagree? Bugs? Humbugs? Post in the comments below or fire me a tweet @cosbynator