Monday, July 29, 2013

Lecture 1: Course overview, PageRank, and vote weighing.

[TODO: Tidy up latex parts]

Course overview:

The major part of this course would be my project, worth 60%. "Only" 40% goes to the finals. 
To put it another way, if I do well in my project, it's the same as doing well in one and a half finals!

PageRank:

Prof. Aleks says that this was behind Google's initial success.

So the problem is this:
Suppose you have a world wide web, with billions of web pages. Nearly all pages have links pointing to other pages, and all pages are capable of being linked by other pages. Suppose that you want to build a search engine, and you need a measure of the popularity of each page. A good measure of popularity is the number of links to the page: for example, the more pages that link to my blog, the more popular it is.

However, as a cunning individual, I can simply create hundreds of other blog pages, each linking to my own blog. This increases my blog's rank in search engines without me having to make my blog any good.

To counter this, one might factor the popularity of the pages that link to my blog in calculating it's popularity. A link form a more popular page (eg. Wikipedia) would contribute more to my blog's popularity than a link from a random corner of the internet. This way, manipulators cannot create circle-jerk "islands" of pages because if none of the pages are popular to begin with, the links they use in an attempt to manipulate a page's rank would have little effect. This turns out to be very effective.

A difficulty of factoring the popularity of pages that link to my blog is that their popularity depends on the popularity of pages that link to them. For pages that link to them, we need the popularity of pages that link to those, and so on. Somewhere in that chain there will be a cycle. In practice there will be countless cycles, making the counting of links a not-so-easy process.

Matrix algorithm:

First, accept that the formula for a page's popularity, 


(where P is any page in the world wide web and #(P) is the number of outgoing links that P has, and phi(P) is the popularity rank of P)

Is only a property and not an actual formula. 

Then, use this matrix equation to come up with ranks.


where p(i) is the popularity of page i, and g(i,j) is either 0, or 1/(number of links page i has) if i links to j.

To calculate P, which is the left eigenvector of the matrix G, we just iteratively evaluate the equation, and magically, P converges. We'll find that the resulting popularities meets the criteria, since by applying the matrix G (which recalculates the popularities of all the pages), we end up with the same set of popularites.

Unfortunately for Google, there are now billions of web pages to rank, so a 2D matrix is too large and god knows how long complete convergence would take. Solution: A rather heuristic and simple method, the random surfer, can be used to produce similar results.

Random surfer:

The random surfer starts on a random page. Then it randomly clicks links, keeping track of how many times it has visited a page. After a random number of clicks, it gets bored and goes to another random page. 

The random surfer method is quite robust. Did the surfer meet a dead end? Just skip to a random page. Did our surfer get stuck in one of those circle-jerk islands? Well it will randomly jump to the rest of the internet soon, and probably come back very sparingly since the island is unpopular and doesn't have many links.

Here is a C# simulation of page ranking using the matrix method and random surfer method.
It creates a random set of pages that randomly links to other pages, with the last page intentionally super popular. Then it compares the matrix method and random surfer method, and show that both methods converge to give a similar popularity rank to each page. 

Vote weighing

IMDB uses "weighted average ratings" for its user voting system for films. They claim "Various filters are applied to the raw data" in order to get "a more accurate vote average".
Their exclusive algorithm which "will not be disclosed", will actually be disclosed here if IMDB happens to use the same vote weighing algorithm I will describe.

Say we have a series of judges, and several movies that needs a rating out of 10. If we take a look at these votes:


Harry Potter Fast and Furious Twilight
Judge 1 8 7 4
Judge 2 7 7 3
Judge 3 5 0 10

You can clearly see that judge #3 is rigging and/or has poor taste. If it were the former, it would be wise to disreguard him altogether. It is fairly obvious to disreguard this voter. However, the voter could be more subtle:

Harry PotterFast and FuriousTwilight
Judge 1874
Judge 2773
Judge 38610

Notice judge 3 now disguises himself as a legit voter with the first two movies in order to gain trust from any sort of weighing algorithm, and then applies a skewed vote to his favourite movie in order to increase its vote. Those who chose to vote at all contain a higher portion of rigging fans than the general population, so it's important to somehow filter the results.

The way explained in the lecture is to use a fixed point iterative algorithm, which involves giving each judge a trust value, and changing them until they're consistent (sort-of trial and error):

Assign every judge the same trust value. (The higher the trust, the more reliable. )
1. Evaluate the rating of each film, which is:

(where Xj is the rating of the jth film, w(i) is the weight of the ith judge, and v(j, i) is the vote of the ith judge for the jth film)



2. Re-evaluate the trustworthiness of each judge, which is proportional to the inverse of the euclidean distance of their votes:
(m is the number of movies)


3. Repeat from (1) until the judge's trustworthiness stops changing.

Once the trustworthiness of each judge is established, the final rating of each film can be determined using step 1.

I've written a no-frills simulation that demonstrates this weighing technique, and you can actually try to rig some votes. You'll see that if you try to rig, your trustworthiness plummets, making all your votes (including the rigged ones) less considered. The more obvious the rig, the less successful it becomes.

Reflection:

Vote weighing and the matrix part of PageRank feel very similar. They both iterate and eventually converge, and I still haven't found the intuition to explain why or how they converge, probably because matrices look intimidating. 

The correlation between random surfer and the popularity property is clear though, in an "ooooohhhhh right!" kind of way.

Thinking about the project. Could end up doing some kind of voice changing software or a voice tone enhancer for online business talks that still let it sound like the same person.

Also I hope I got everything right.