There are hundreds of different methods to rank universities. Choosing a method is tough, because what does it mean for one university to be better than another? I was considering the question in Data Mining class and decided a good gauge is influence. For example, Harvard is ‘better’ than McGill because it has contributed more to world knowledge. And there is a great proxy for world knowledge: Wikipedia.
Wikipedia is semi-structured and machine-readable. Can Wikipedia be used to measure influence and will that measure produce meaningful ranking? Let’s see.
World University Ranking
The top fifteen universities picked by my algorithm are listed above. If you are interested in a particular university, punch it into the text box and click.
The ranking algorithm is actually a variant of Google’s: I have computed the PageRank for each Wikipedia article by following intra-wiki links. The code details are available but here’s a quick introduction.
Imagine an infinitely bored surfer, Thomas, who reads every article for exactly one second and then chooses a new article in two ways: 85% of the time Thomas clicks a random link in the current page and 15% of the time he navigates to Special:Random, which leads to a random page in Wikipedia. If Thomas browses for 1 day, how many seconds is he expected to stay on the page for Harvard University? (the answer is about 14 seconds).
Influential pages are linked to by other influential pages. Thus, Thomas will visit important pages more often than unimportant ones. The expected amount of viewing time for a page is called the PageRank. Consider another question: if Thomas is viewing a page for a university what are the chances that he is viewing Harvard? (the answer is around 2%). Formally, the answer is the random walk’s stationary distribution, conditioned on visiting a university page. Informally, it is the influence of Harvard amongst other universities.
The PageRank gives a total ordering on all universities in the world as shown in the ranking widget. The widget also has one extra column: the relative importance. The measure quantifies the influence of a university with respect to Harvard. Formally, the importance is the odds ratio between the PageRank of Harvard and the PageRank of the university.
The idea of a ranking of influence inspires question about the distribution of universities. How many Harvards are there in the world? Do the world’s top ten actually contribute that much more to knowledge?
This visualization shows a log-scaled histogram of the influence of the universities of the world. Hovering / clicking a bar will show you a few sample universities within the tier.
The distribution looks roughly normal to me, indicating that influence is distributed log-normally (although I’m not a statistician!). It indicates that influential universities are much rarer than less influential universities. These results are surprising to me – I had always assumed that there were a few elite institutions and most others were of similar influence. The widget shows that influence is actually widespread, and that there are many different “tiers” of universities (with the majority on the very low end). Who knew?
The ranking seems reasonable – the top fifteen are household names that more or less agree with rankings like the Academic Ranking of World Universities. There are a couple of interesting quirks worth calling out:
- Columbia: Unlike other lists, Columbia sits above above big names like Yale and Stanford. I computed a list of the “biggest influencers” – pages that constitute most of the PageRank mass for Columbia. These are Columbia University Press (0.019%), The New York Times (0.019%), Columbia Journalism Review (0.0094%), Columbia Encyclopedia (0.0083%) and New York (0.0072%). The influence scores are relatively small since Columbia is linked by a lot of articles. In fact, I found that Columbia is frequently linked in the citation section of many articles (e.g. China). I hypothesize that Columbia appears more frequently as a link, as opposed to free text, within citation sections on Wikipedia. One could argue this is a bug or the algorithm correctly identifying a pattern.
- English-speaking bias: The first university from outside the English speaking world is Humboldt University of Berlin, at 36th place. Other lists often rank places like ETH Zurich in the top twenty (down at 116th in the ranking). The feature is unsurprising since the data is only the English language Wikipedia. The problem could be “fixed” by blending in other subdomains, but academic literature probably has an English bias too.
The project is still a work in progress. Recently, I’ve been experimenting with different ways of presenting these algorithmic top ten lists. My code is all available on github (it’s messy!). The “magic” of PageRank happens in pagerank.go but there is a disgusting amount of data massaging to get the Wikipedia graph into a rankable state. Some of that is in extractgraph.go. My PageRank code is an interpretation of the pseudo-code in the wonderful Mining of Massive Datasets book (free online!).
Although PageRank gives an influence score for each page, I still have to identify all the universities of the world. This is the process of finding a master list for the education category. Unfortunately, Wikipedia’s list is too unstructured to mine easily. Instead, I scraped the 4icu.org university list. After extracting a list of names, I look them up in Wikipedia’s page titles to find my ranking. The process is made robust because of Wikipedia’s dense set of redirects that map acronyms and alternate spellings to a canonical source page.
My scraping code is written in a few lines of Python and available in scrape_universities.py.
The code operates off a giant Wikipedia XML dump. These are all available here (I used a dump from a month ago). The project was also my first foray into the Go programming language but I’ll withhold my judgement for this article.
I’m pleased with my early results but they represent a holistic view to ranking. People choose universities with a field in mind; in computer science Carnegie Melon University is regarded as world-class while my Wikipedia-scoring has it as 55th in the world. There is a PageRank variant called Topic-Sensitive PageRank (TSPR) that scores based on a subject-biased random walker. I’d love to see if using TSPR changes the rankings substantially.
Furthermore, universities are only the tip of the iceberg. My WikiRank system can produce an ordered list of anything. I’ve already started producing a list of movies (sneak preview: Citizen Kane is on top). Expect a few more linkbait top ten blog posts in the future!
I’m a Master’s student studying Artificial Intelligence at Stanford University. Notice something I missed? Post in the comments or tweet @cosbynator.