In the Google original paper, Sergey Brin and Lawrence Page stated the PageRank can be thought of as a model of user behavior:
PageRank can be thought of as a model of user behavior. We assume there is a " random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.
This page compares the PageRank algorithm and the random surfer on some example to check if the results are similar. We will not detail the PageRank algorithm already explained on this page.
The random surfer algorithm is implemented on MATLAB. The network is described with an adjacency matrix that represents links between pages:
$$ A = \left[ \begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{matrix} \right] $$
The above adjacency matrix describes the following directed graph:
Values on each node represent the PageRank already calculated with the Google PageRank algorithm presented on this page.
To implement the random surfer algorithm, let's start with the following vector for the number of visits for each page (node of the graph). N is the number of nodes (here N=3):
Visits = zeros(N,1);
Let's define the starting page the surfer picked randomly:
Page = randi(N);
Let's define \(d\) the damping factor, the probability at each page the random surfer will get bored and request another random page.
d=0.85;
The algorithm is quite simple, here we loop 1000 times. At each step:
%% Itearate 1000 times
for i=2:1000
%% Increase visits for the current page
Visits(Page,1) = Visits(Page,1) + 1;
%% Randomly pick a new page from the links going out of the current page
% L are pages linked by Page
L=find (A(Page,:)==1);
% Idx is the index of the link
Idx = randi(length(L));
% Next page
Page = L(Idx);
%% Damping factor, the surfer request another random page 15% of the time
if (rand(1)>d)
Page = randi(N);
end
end
At the end, the PageRank is calculted by normalizing the visits :
$$ PageRank(i) = \dfrac{N*Visits(i)}{\sum_{i=1}^{N}{Visits(i)}} $$
Here is the MATLAB code:
PR = N*Visits/sum(Visits);
Letting the random surfer travelling over 1000 pages provides the following results:
$$ Visits = \left[ \begin{matrix} 215 \\ 402 \\ 382 \end{matrix} \right] $$
The associated PageRank is:
$$ PageRank = \left[ \begin{matrix} 0.6456 \\ 1.2072 \\ 1.1471 \end{matrix} \right] $$
First observation, the PageRanks calculated with the random surfer is very close to the PageRank algorithm:
Node | PageRank | Random Surfer |
---|---|---|
1 | 0.6444 | 0.6456 |
2 | 1.1922 | 1.2072 |
3 | 1.1634 | 1.1471 |
Here is the evolution of the PageRanks with the random surfer algorithm:
The algorithm needs about 200 steps to converge where the Google PageRank algorithm needs less than 20 steps as shown on the figure below:
Of course, the random surfer algorithm does not work is a page has no links going out. I tryed to implement a new version where the surfer pick a random page if the current page has no link. But the results are not consistent with the PageRank algorithm.
Another interesting fact is when a page has no incomming link:
As before, the value for each node are the PageRanks calculated with the Google PageRank algorithm presented on this page.
Node | PageRank | Random Surfer |
---|---|---|
1 | 0.1500 | 0.1411 |
2 | 1.4595 | 1.4595 |
3 | 1.3905 | 1.3994 |
Values of the random surfer algorithm are consistent.
The last case is a disconnected graph:
Since the surfer jumps on a new page 15% of the time, the whole graph is thus visited:
Node | PageRank | Random Surfer |
---|---|---|
1 | 1 | 0.9690 |
2 | 1 | 0.9449 |
3 | 1 | 1.0370 |
4 | 1 | 1.0490 |
Note that the random surfer is less accurante than the Google PageRank algorithm.
Random surfer algorithm has been implemented with Matlab. All results presented on this page has been calculated with the following script: