Ranking Your University Using PageRank on Wikipedia

Dusk at Stanford University
Dusk at Stanford University
There are hundreds of different methods to rank universities. Choosing a method is tough, because what does it mean for one university to be better than another? I was considering the question in Data Mining class and decided a good gauge is influence. For example, Harvard is ‘better’ than McGill because it has contributed more to world knowledge. And there is a great proxy for world knowledge: Wikipedia.

Wikipedia is semi-structured and machine-readable. Can Wikipedia be used to measure influence and will that measure produce meaningful ranking? Let’s see.

World University Ranking


The top fifteen universities picked by my algorithm are listed above. If you are interested in a particular university, punch it into the text box and click.

The Algorithm

The ranking algorithm is actually a variant of Google’s: I have computed the PageRank for each Wikipedia article by following intra-wiki links. The code details are available but here’s a quick introduction.

Imagine an infinitely bored surfer, Thomas, who reads every article for exactly one second and then chooses a new article in two ways: 85% of the time Thomas clicks a random link in the current page and 15% of the time he navigates to Special:Random, which leads to a random page in Wikipedia. If Thomas browses for 1 day, how many seconds is he expected to stay on the page for Harvard University? (the answer is about 14 seconds).

Influential pages are linked to by other influential pages. Thus, Thomas will visit important pages more often than unimportant ones. The expected amount of viewing time for a page is called the PageRank. Consider another question: if Thomas is viewing a page for a university what are the chances that he is viewing Harvard? (the answer is around 2%). Formally, the answer is the random walk’s stationary distribution, conditioned on visiting a university page. Informally, it is the influence of Harvard amongst other universities.

The PageRank gives a total ordering on all universities in the world as shown in the ranking widget. The widget also has one extra column: the relative importance. The measure quantifies the influence of a university with respect to Harvard. Formally, the importance is the odds ratio between the PageRank of Harvard and the PageRank of the university.

Influence Distribution

The idea of a ranking of influence inspires question about the distribution of universities. How many Harvards are there in the world? Do the world’s top ten actually contribute that much more to knowledge?

This visualization shows a log-scaled histogram of the influence of the universities of the world. Hovering / clicking a bar will show you a few sample universities within the tier.

The distribution looks roughly normal to me, indicating that influence is distributed log-normally (although I’m not a statistician!). It indicates that influential universities are much rarer than less influential universities. These results are surprising to me – I had always assumed that there were a few elite institutions and most others were of similar influence. The widget shows that influence is actually widespread, and that there are many different “tiers” of universities (with the majority on the very low end). Who knew?

New Information

The ranking seems reasonable – the top fifteen are household names that more or less agree with rankings like the Academic Ranking of World Universities. There are a couple of interesting quirks worth calling out:

  • Columbia: Unlike other lists, Columbia sits above above big names like Yale and Stanford. I computed a list of the “biggest influencers” – pages that constitute most of the PageRank mass for Columbia. These are Columbia University Press (0.019%), The New York Times (0.019%), Columbia Journalism Review (0.0094%), Columbia Encyclopedia (0.0083%) and New York (0.0072%). The influence scores are relatively small since Columbia is linked by a lot of articles. In fact, I found that Columbia is frequently linked in the citation section of many articles (e.g. China). I hypothesize that Columbia appears more frequently as a link, as opposed to free text, within citation sections on Wikipedia. One could argue this is a bug or the algorithm correctly identifying a pattern.

  • English-speaking bias: The first university from outside the English speaking world is Humboldt University of Berlin, at 36th place. Other lists often rank places like ETH Zurich in the top twenty (down at 116th in the ranking). The feature is unsurprising since the data is only the English language Wikipedia. The problem could be “fixed” by blending in other subdomains, but academic literature probably has an English bias too.

The Code

The project is still a work in progress. Recently, I’ve been experimenting with different ways of presenting these algorithmic top ten lists. My code is all available on github (it’s messy!). The “magic” of PageRank happens in pagerank.go but there is a disgusting amount of data massaging to get the Wikipedia graph into a rankable state. Some of that is in extractgraph.go. My PageRank code is an interpretation of the pseudo-code in the wonderful Mining of Massive Datasets book (free online!).

Although PageRank gives an influence score for each page, I still have to identify all the universities of the world. This is the process of finding a master list for the education category. Unfortunately, Wikipedia’s list is too unstructured to mine easily. Instead, I scraped the 4icu.org university list. After extracting a list of names, I look them up in Wikipedia’s page titles to find my ranking. The process is made robust because of Wikipedia’s dense set of redirects that map acronyms and alternate spellings to a canonical source page.

My scraping code is written in a few lines of Python and available in scrape_universities.py.

The code operates off a giant Wikipedia XML dump. These are all available here (I used a dump from a month ago). The project was also my first foray into the Go programming language but I’ll withhold my judgement for this article.

Stay Tuned

I’m pleased with my early results but they represent a holistic view to ranking. People choose universities with a field in mind; in computer science Carnegie Melon University is regarded as world-class while my Wikipedia-scoring has it as 55th in the world. There is a PageRank variant called Topic-Sensitive PageRank (TSPR) that scores based on a subject-biased random walker. I’d love to see if using TSPR changes the rankings substantially.

Furthermore, universities are only the tip of the iceberg. My WikiRank system can produce an ordered list of anything. I’ve already started producing a list of movies (sneak preview: Citizen Kane is on top). Expect a few more linkbait top ten blog posts in the future!

I’m a Master’s student studying Artificial Intelligence at Stanford University. Notice something I missed? Post in the comments or tweet @cosbynator.

  • http://danielcadenas.com/ Daniel Cadenas

    Nice! I just implemented a pagerank library for go and I wonder how would it perform with this. You are welcome to give it a try https://github.com/dcadenas/pagerank

    • tdimson

      Cool! Looks like a pretty reasonable implementation. Ironically, my most expensive step is parsing the wikipedia XML and not actually performing power iteration. It is something like a 100:1 ratio in time.

  • Pingback: The Top 25 Movies of All Time Chosen by an Algorithm | Arg! Team Blog

  • Pingback: A History of Film in Wikipedia | Arg! Team Blog

  • Matt Williams

    I’m a bit surprised that the University of Warwick doesn’t show up on the list at all. 4icu gives it as the #11 university in the UK. Perhaps there’s some other big ones that have been missed somehow?

    • tdimson

      Ah, this is kind of a bug / data issue. There is no redirect on wikipedia from “The University of Warwick” (4icu’s name) to “University of Warwick”. Anyway, I crunched the data by hand and it is 153rd at about 5.4% of Harvard. Thanks!

  • http://santini.si.unimi.it/ Massimo Santini