College RankingsA Big Data Approach

College and university rankings are a serious but tricky business. The most notable one in the US is conducted by US News and World Reports. They publish the ranking results towards the end of every year when kids start applying to colleges. The rankings change slightly over the years to make their publications worthwhile.

In addition to US News, other news media also publish university rankings on an annual basis, such as Forbes and Times in London. Recently, some of the rankings conducted in Chinese universities grabbed people's attention. Among them, Shanghai Rankings is probably most worth mentioning.

In the US News rankings, schools are divided into Liberal Arts colleges which mostly focus on four year undergraduate education, and universities which are further sub classified into national universities and regional ones. Otherwise, small liberal arts schools will not be treated fairly under any of the current ranking methodologies because size always weighs in no matter what methodology is adopted. In the US News rankings, SAT and ACT scores of incoming students are one of the most important factors together with others such as graduation rates, peer reviews and etc. SAT and ACT scores are intended to measure the incoming quality whereas graduation rate is presumed to set for outcome.

Shanghai Ranking is more tailored for academic excellence, therefore, the number of Nobel Prize laureates and Fields Medal winners and etc. are given very heavy weights. In some other rankings, the percentage of graduate students over total student population is taken into consideration. The simple rationale behind might be the more the graduate students, the more the research activities.

In my ranking, however, most raw data are taken from Wikipedia. Currently, there are over 4.4 million articles in the English version of Wikipedia alone. As a comparison, the Britannica has only about 66 thousands. Now more and more big data projects have taken the entire Wikipedia as raw input materials as it becomes a sketchy reflection of total human knowledge. In Wikipedia, there are links between articles due to relationships of facts that those articles are concerned about. For example, in the article about the great British physicist Steve Hawking, there are facts mentioned that he graduated from University of Oxford and University of Cambridge, and he was once at California Institute of Technology as a visiting scholar. Therefore, three links are established from Steve Hawking's article to the articles about the three schools respectively. In my rankings, I take slightly more than 1500 schools which is a proper subset of the schools that have Wikipedia entries, and I count links to those school articles from the rest of the Wikipedia.  This number of total incoming links, to some degree, correlates to their reputations. Of course, this simple yet comprehensive measure may not give a complete picture of the schools in details. But it should not be treated lightly. When the data set is big enough, the counting tells a lot of truth.

In addition to the inbound links, I sampled 230,000 people who have entries in Wikipedia and counted their alma maters. The people who have more influence will contribute more to their alma maters’ rankings. For example, in the 2013-10-01 dump of Wikipedia, Bill Clinton has 8465 inbound links whereas Hilary Clinton has 3428 and Eisenhower 4557, then Bill contributed significantly more to the rankings of his alma maters Oxford and Yalethan Hilary and Eisenhower did to theirs Wellesley and West Point, respectively. In Wikipedia, the reference to alma mater does not follow any standard format. In many cases, it simply states somebody graduated from Harvard Law School instead of Harvard University. I have to mine the Wikipedia Categories hierarchy to find out that the Harvard University, to whom the alma mater points should be credited, is actually the parent organization of the Harvard Law School. After introducing the alma mater parameter, some small schools got some elevation. For example, in the 2014-01 version (http://www.nicksrankings.com/index2014-01.html )of my rankings, Amherst College was ranked 125, in the 2014-02 version (http://www.nicksrankings.com/index2014-02.html ), it was ranked 108. But still, schools such as California Institute of Technology did not achieve the rankings they deserved.

In the latest version my rankings (2014-03,  http://nicksrankings.com/ ), two additional parameters are introduced to measure the educational resources per student. One is the endowment per student. The other is professor/student ratio. This has further promoted rankings of small colleges. For example, the usual top three liberal arts colleges Amherst, Swarthmore and Williams are rankings 39, 49 and 50, respectively. Caltech moves ahead to 30. Two schools are worth particularly mentioning, Rockefeller University and King Abdullah University of Science and Technology. Rockefeller has USD 1.65 billion endowment with only slightly more than 200 PhD students.  King Abdullah University of Science and Technology is founded as late as 2009 with the goal of being Arab MIT. It has USD 10 billion endowment, and obviously is the fastest developed university from any perspective.

The global overall rankings include all schools in my database. But I also make Liberal Arts a separate category. It may sound surprise to some that United States Military Academy (West Point) and United States Naval Academy ranked the top two, surpassed the traditional top three which are Amherst, Williams and Swarthmore. I guess those alumni generals contributed significantly. Those who made the history deserves more than those who wrote it.

In the subject ranking, I take advantage of category system of Wikipedia (I will write separately about a tricky problem of the Wikipedia category hierarchy.) I take most of the articles under a category and again count their links to a school to get the score of the school in that particular subject. For example, I count inbound links to Harvard University from (almost all) articles under Category: Mathematics to get Harvard's reputation in Mathematics.

All scores are arranged in this way: the number one in ranking is given 100, and the rest will be calculated by log(raw counting of the school)/log(raw counting of number one)*100. All school names are taken from the titles of the corresponding Wikipedia articles. Because I only processed the English version of Wikipedia, the schools from non-English speaking regions may not get the fairest treatments. I may consider compensating this in the future by counting other high quality versions of different languages, particularly German, as its quality and quantity justify. But I cannot foresee myself counting the Chinese version as both the number and quality are too low to serve this purpose based on my current estimates.

A picture is worth a thousand of words. I have incorporated Google map since the 2014-02 version of my rankings. Visual discovery has never been made easier. It is not surprising to find that most of the top 500 schools are geographically concentrated in northeastern corner of US and Western Europe, followed closely by California.