Timotej Kovacka

Software engineer in London

profile picture

PageRank

In this article, you will be able to explore the PageRank algorithm. PageRank is one of the most influential algorithms of the 21st century and despite this, it is still highly likely that you have never heard of it even though you make use of it probably every day. PageRank has been published in 1998 by Page and Brin and in order to test the capabilities of the algorithm, the website google.com has been built at Stanford by the team.

The reason I chose to write about PageRank was that when I was 17 my Math teacher had just started teaching linear algebra and with it the concepts of eigenvalues and eigenvectors. He then made a comment about how these concepts are used at the center of google searches, this has left such a deep impression on me that 5 years later here I’m writing my first blog article about the topic. My fascination with this revelation did not subside a single bit during this time and only grew larger with a deeper understanding of the problem this algorithm solved. Hopefully, by the end of this article, you will too have a similar appreciation of the brilliant yet simple algorithm behind google searches. I will provide you with 2 explanations of the algorithm. Firstly a simplified version that can be understood without the need for knowledge in mathematics and the second one will go into more depth about the algorithm and principles behind it while not requiring reading the former.

The popularity of the World Wide Web (WWW) increased at the end of the 20th century and already hosted over 100 million web pages with a doubling life of less than one year. It became a serious question of how to efficiently search and rank the relative importance of web pages. At the time the most popular search engines were AltaVista, Lycos, and Yahoo where these engines mostly used hyperlink counts to web pages as the main ranking criteria, PageRank provided a somewhat different approach to the bunch.

The main problem was that WWW created new unfaced challenges for information retrieval, of being very large and heterogeneous. Web pages were very diverse, ranging from “What should I eat to look like Arnold Schwarzenegger” to journals about quantum mechanics. On top of this search engines on the Web must contend with inexperienced users and pages intentionally designed to manipulate search engines ranking functions.

Something similar to searching and ranking web pages on the internet is academic citation analysis, one could imagine the various hypertext links in a web page as citations in a research paper. At the time of the creation of PageRank there already was a good understanding of academic citation analysis so why not use it since the two concepts are similar? Despite my previous claim that they are alike there are significant differences between web pages and academic publications. Some of these are that academic publications are heavily reviewed before being published, unlike web pages which can be created easily, artificially inflating “citation” counts. Another is that academic papers are well-defined units of work, roughly similar in quality and number of citations as well as their purpose - to extend the body of knowledge. Web pages on the other hand vary on a much larger scale in quality, usage, citations, and length. PageRank has been created so that it measures the relative importance of web pages allowing us to then rank these web pages.

The Instagram model

In order to explain PageRank, I will use a comparison of web pages to Instagram accounts, I will do this for 2 reasons. One is that I want to use a model with which a large number of people are familiar. In such a case, Instagram is a perfect example as our generation is highly familiar with the application. I have to also note that we will consider all Instagram accounts in this model to be public so we can see who they follow and who follows them. The second reason is that we want to focus on only a few aspects that are all present on Instagram. These factors will be the followers of an account and the accounts that are followed by this account. I will do this to represent the similarity of how web pages point to each other through their links inside each of them. So how do we know which Instagram accounts are more relatively important?

graph TD
subgraph "Popular Account Example"
LH[Lewis Howes<br>2M Followers]
F1[Follower 1]
F2[Follower 2]
F3[Follower 3]
F4[...]
F5[Follower 2M]
F1 -->|contributes rank| LH
F2 -->|contributes rank| LH
F3 -->|contributes rank| LH
F4 -->|contributes rank| LH
F5 -->|contributes rank| LH
end

We can say that this will depend on the followers one has. Take for example Lewis Howes and his Instagram account “lewishowes” even if we have no prior knowledge of his account we can make claims about its importance. His account has 2 million followers so naturally, we would say he is a relatively important person. From here onwards, we will use the notion of Rank to describe the relation between followers of an account and relative importance just as I did with Lewis.

It is also important to note that not all accounts will have equal Rank just like web pages. If one were to compare Google and this blog the former would always come out on top. The more popular one is the higher Rank it will have BUT this is not all there is to it! We can also exploit some of the information about the accounts that follow us to make a more accurate estimation of a Rank. If an account has many followers but follows only a few, these accounts they follow will have higher importance than if the same account followed as many accounts as there were following him. We can say that one’s Rank is equally divided between the accounts they follow.

graph TD
subgraph "The Rock's Rank Distribution"
TR[The Rock<br>300M Followers]
A1[Account 1]
A2[Account 2]
A3[Account 3]
A4[...]
A5[Account 500]
TR -->|Rank/500| A1
TR -->|Rank/500| A2
TR -->|Rank/500| A3
TR -->|Rank/500| A4
TR -->|Rank/500| A5
style TR fill:#f9f,stroke:#333
end

A good example of this is an account of a person like Dwayne “therock” Johnson who has over 300 million followers and follows only around 500 accounts. We have established that accounts with a large number of followers are going to have high Rank. So what can we now say about these 500 accounts he follows without knowing anything about them? What if he followed 300 million accounts would this change the way we feel about these accounts? If your answer in the first case was that they are unique, important or worthwhile checking and then in the second you said that they are not important, common. Then you would be absolutely right and this goes to show our initial hypothesis.

Unfortunately, PageRank is not as simple as this. There is yet another difficulty with this model of ours but thankfully there is also a solution to it. Let me introduce the last obstacle in our way of understanding this groundbreaking algorithm. We have established that Rank is something tied to the followers one has because of this we need to consider a special case. A case where I only follow one account and that account only follows me. Since Rank is equally divided between the accounts an account follows in this case the Rank will be stuck not making any change. A similar situation also happens if somebody does not follow anyone since Rank from this place has nowhere to go to. This is called a Rank sink.

graph TD
subgraph "Rank Sink Examples"
A1[Account 1]
A2[Account 2]
A3[Account 3]
A1 -->|All rank| A2
A2 -->|All rank| A3
A3 -->|All rank| A1
B1[Dead End<br>Account]
B2[Account X]
B3[Account Y]
B2 -->|rank lost| B1
B3 -->|rank lost| B1
style A1 fill:#faa,stroke:#333
style A2 fill:#faa,stroke:#333
style A3 fill:#faa,stroke:#333
style B1 fill:#faa,stroke:#333
end

To solve this we have to introduce a way of escaping the sink. We will implement something like a boredom factor that will allow us to “get bored” of an account and go to some other account at random.

Okay I know, that was a lot of definitions and comparisons but bear with me a little while longer we are almost at the end. Since PageRank is an algorithm and I know that maybe not everyone might be familiar with what the term represents, let me give you a very short meaning of the word from the Oxford dictionary “A documented series of steps which leads to the transformation of some data”. The beauty of this particular algorithm comes from how through these series of steps it is able to create a ranking of web pages in terms of their Rank. The best way to now show how the algorithm works is to use surfers (users). Let me show you what I mean by that:

  1. Imagine that initially there will be a random number of surfers at each of the accounts we know of.
  2. Then at the same time, each of these surfers will move to the next account from the list of accounts the current one follows. Going to any one of the accounts from the list has an equal chance. A surfer also has the chance to start at some other account that isn’t part of the list with a certain chance instead of going to any of the accounts one follows, representing the boredom factor.
  3. Repeat step 2 until the number of surfers at any given account does not change anymore.
  4. Now the numbers of surfers at each account represent this account’s PageRank.
graph TD
subgraph "Random Surfer with Damping Factor"
U1[User's<br>Current Page]
N1[Next Page 1]
N2[Next Page 2]
R[Random Page<br>in Network]
U1 -->|85% Follow Links| N1
U1 -->|85% Follow Links| N2
U1 -.->|15% Random Jump| R
style R fill:#aaf,stroke:#333
style U1 fill:#f9f,stroke:#333
end

Now that we know the Ranks of all the accounts in our network when someone comes and searches for “Kim” on Instagram we will just retrieve and suggest the accounts which have accumulated the highest Rank that is relevant to “Kim”.

This is the very essence of how searching works on google. There are other important factors used in the real case of google but those are not as important as PageRank. To summarize things, the real challenge is to create a way of assigning Rank to a web page (account) based on the links (followers) that lead from other pages to this account and this is what PageRank is in its essence, a way of assigning importance to a network of web pages. Once this is done when somebody wishes to search the network we just find the most relatively important web page that corresponds to our searching query and show them the pages with the highest PageRank.

From a mathematical point of view

PageRank is a method for computing a rank for every web page based on the graph of the web. This graph is a directed graph that consists of nodes (web pages) and edges (hypertext links) between these nodes. Each node has two sets of edges. The first is a set of forward links (outedges) and the second of backlinks (inedges). One can never know whether he has found all the backlinks of a particular page but if we download it (view it) we will know all of its forward links at that time.

Web pages vary greatly in terms of backlinks they have. An example of this would be that a web page like “google.com” has an overwhelming number of backlinks to it when compared to an average web page. In general, highly linked pages are relatively more important than pages with few links. Interestingly enough a simple backlink counting won’t represent relative importance and break the common-sense notion of importance. Take an example if a web page has a link from Google it is only a single link but its importance is greater than a web page with multiple links from obscure places. PageRank is an attempt to approximate this relative importance just from the concept of the link structure of a network.

The intuitive way to describe a PageRank; a web page has a high rank if the sum of the ranks of its backlinks is high. Let u be a web page. Then let FuF_u be the set of pages u points to and BuB_u be the set of pages that point to uu. Let Nu=FuN_u = |F_u| be the number of links from uu and let cc be a factor used for normalization. Then a simple version of PageRank can be calculated as:

R(u)=cvBuR(v)NvR(u) = c \sum_{v \in B_u}\frac{R(v)}{N_v}

The important thing to note here is that the rank of the page is evenly divided among its forward links to contribute to the rank of pages they point to. Another noteworthy thing is that this relationship is recursive but it may be computed by starting with any set of ranks and iterating the computation until it will converge.

And this is where the aforementioned linear algebra comes to play. Another way to represent this is as follows: Let A be a square matrix with rows and columns corresponding to web pages. Let Au,v=1/NuA_{u,v} = 1 / N_u if there is an edge from uu to vv and Au,v=0A_{u,v} = 0 if not. If we treat RR as a vector over web pages, then we have R=cARR = cAR. So RR is an eigenvector of AA with eigenvalue cc.

And more importantly, we want the dominant eigenvector of AA. It can be computed by repeatedly applying AA to any non-degenerate start vector. Even though we have just begun I already think that this relationship is beautiful in its simplicity as to what it actually describes and implies. But there is a problem with this simplified version if two web pages point to each other but no other page. If there was a third page that points to one of these pages then during iteration this loop accumulates rank but does not distribute any rank. The algorithm’s creators call this the rank sink and to solve it they introduce a rank source.

Let E(u)E(u) be some vector over the Web pages that correspond to a source of rank. Then, the PageRank of a set of Web pages is an assignment, RR', to the Web pages which satisfies

R(u)=cvBuR(v)Nv+cE(u)R'(u) = c \sum_{v \in B_u}\frac{R'(v)}{N_v}+cE(u)

such that cc is maximized and R1=1||R'||_1 = 1

And now for a little bit more linear algebra, we can say R=c(AR+E)R' = c( AR' + E). Since R1=1||R'||_1 = 1, we can rewrite the equation to R=c(A+E)RR' = c(A + E)R' and from this we know that RR' is an eigenvector of (A+E)( A + E)

The computation of PageRank is now quite easy if we ignore the fact it needs to be carried out for millions of pages. It is computed as follows:

Algorithm PageRank(G(V,E), c, ε, maxIter)
# Initialize PageRank for all vertices
∀v ∈ V: R₀(v) ← 1/|V|
iter ← 0
diff ← ∞
while (diff > ε ∧ iter < maxIter) do
# Compute new rank for each vertex
∀u ∈ V: Rₙₑₓₜ(u) ← (c × ∑ᵥ∈B(u)(Rᵢ(v)/Nᵥ)) + (1-c)/|V|
# Compute L1 norm difference
diff ← ||Rₙₑₓₜ - Rᵢ||₁
# Update current ranks
Rᵢ ← Rₙₑₓₜ
iter ← iter + 1
end while
return Rᵢ
where:
B(u): set of vertices pointing to u
Nᵥ: number of outbound edges from vertex v
c: damping factor (typically 0.85)
||•||₁: L1 norm

That will be all for our little mathematical window. The last important hindrance to this model is dangling links which are links that point to pages with no outgoing links. They affect the model because it’s not clear how the rank should be distributed so the authors have decided to ignore them while computing PageRank and then add it at the end of the calculation even though this will reduce the effectiveness of PageRank but only slightly.

Another great property of PageRank is that it converges reasonably fast around 52 iterations at the time the paper of ‘98 was published and 45 iterations on half the data. This shows that PageRank will scale very well even for extremely large collections as the scaling factor is roughly linear in log(n).

Even though PageRank is relatively important for google searches there are other factors such as standard IR measures, proximity, and anchor text. Before google was established the various search engines did not account for the relative importance of web pages when evaluating search queries but google changed this.

Final remarks

The importance of algorithms like PageRank is evident now 24 years later as Google is one of the leading technology companies in the world. Its search engine occupies 92% of the market as of 2022 and has only been growing since its creation. Personally, I find it fascinating on many different levels as such an idea has shifted the direction of the Internet to such an extent as no other thing probably has. Feel free to leave your thoughts in the comment section below.

Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.