What happens when we do a web search and how does Google rank search results?
How Google’s ranking and website evaluation process works starting with the crawling and analysis of a site, crawling timelines, frequencies, priorities, indexing and filtering processes within databases etc?
These are the questions that come to our mind when we think of ranking our blog post on the first page of Google search results. If this is the case with you then you’re on the right platform.
In this article, I’ll guide you to complete ranking and evaluation process of Google and how it’s algorithm works. Though the article is going to be long, keep patience and read to the bottom if you really want to learn.
And I assure you’ll be satisfied by the end of the article and grab something beneficial.
The first thing to understand is that when you do a Google Search, you aren’t actually searching the web, you’re searching Google’s index of the web. Google does this with software programs called spiders.
Spider start by fetching a few web pages then they follow the links on those pages and fetch the pages they point to, and follow all the links on those pages and fetch the pages they link to and so on until it has indexed a pretty big chunk of the web. Many billions of pages stored across thousands of machines.
Now, suppose I want to know how fast a cheetah can run?
I type in my search, say, the cheetah running speed and hit enter.
Google’s software searches its index to find every page that includes those search terms. In this case, there are hundreds of thousands of possible results.
How Does Google Decide Which Few Documents You Really Want?
Google comes to conclusion by asking questions–more than 200 of them like, how many times does this page contain your keywords?
Do the words appear in the title, in the URL, directly adjacent?
Does the page include synonyms for those words?
Is this page from a quality website or is it low quality, even spamming?
What is this pages PageRank? That’s a formula invented by Google founders Larry Page and Sergey Brin that rates a web page’s importance by looking at how many outside links point to it, and how important those links are?
Finally, Google combines all those factors together to produce each page’s overall score and send you back your search results about half a second after you submit your search. Google takes it’s commitment to delivering useful and impartial search results very seriously.
As far as I know, it doesn’t ever accept payment to add a site to their index, update it more often or improve its ranking. So, you need not worry, there are chances that if you write useful content you’ll be indexed on the very first page of Google Search Results.
How Google’s Ranking and Website Evaluation Process Works?
So let me show how much of a feel I can give you for how the Google infrastructure works?
How it all fits together?
How Google’s crawling and indexing and serving pipeline work?
Let’s dive right in.
So there are three things that you really want to do well if you want to be the world’s best search engine.
- You want to crawl the web comprehensively and deeply.
- You want to index those pages.
- And then you want to rank or serve those pages and return the most relevant ones first.
Crawling The Web
Crawling is actually more difficult than you might think. When Google started, it didn’t manage to crawl the web for something like three or four months and their members had to have a war room.
But a good way to think about the mental model is Google basically take page rank as the primary determinant and the more page rank you have i.e. the more people who link to you and the more reputable those people are, the more likely it is Google going to discover your page relatively early in the crawl.
In fact, you could imagine crawling in strict page rank order, and you’d get the CNN’s of the world and The New York Times of the world and really very high page rank sites.
And if you think about how things used to be, Google used to crawl for 30 days. So it would crawl for several weeks and then it would index for about a week and then it would push that data out. And that would take about a week.
And so that was what the Google dance was.
Indexing And Filtering Process Within Databases
Sometimes you would hit one data centre that had old data and sometimes you would hit a data centre that had new data.
Now there are various interesting tricks you can do. For example, after you’ve crawled for 30 days, you can imagine recrawling the high page rank guys so you can see if there’s anything new or important that’s hit on the CNN home page.
But for the most part, this is not fantastic right?
Because if you’re trying to crawl the web and it takes you 30 days, you’re going to be out-of-date.
So eventually, in 2003, I believe, Google switched as part of an update called Update Fritz to crawling a fairly interesting significant chunk of the web every day. And so if you imagine breaking the web into a certain number of segments, you could imagine crawling that part of the web and refreshing it every night.
And so at any given point, your main base index would only be so out of date because then you’d loop back around and you’d refresh that and that works very very well.
Instead of waiting for everything to finish, you’re incrementally updating your index. And Google has gotten even better over time. So at this point, it can get very, very fresh. Anytime it sees the updates it can find them very quickly.
In the old days, you would have not just a main or a base index, but you could have what were called Supplemental Results, or the Supplemental Index which was something that we wouldn’t crawl and refresh quite as often but it was a lot more documents.
And so you could almost imagine having really fresh content, a layer of Google’s main index, and then more documents that are not refreshed quite as often, but there was a lot more of them.
So that’s just a little bit about the crawl and how to crawl comprehensively.
What you do then is you pass thigs around and you basically say, OK, I have crawled a large fraction of the web and within that web, you have, for example, one document.
And indexing is basically taking things in word order.
Well, let’s just work through an example.
Suppose you say, Angelina Jolie.
In a document, Angelina Jolie appears right next to each other.
But what you want in an index is which documents does the word Angelina appear in, and which documents does the word Jolie appear in?
So you might say Angelina appears in documents 1, and 2, and 89, and 555, and 789.
And Perry might appear in documents number 2, and 8, and 73, and 555, and 1,000.
And so the whole process of doing the index reversing, so that instead of having the documents in word order, you have the words, and they have it in document order.
So it’, OK, these are all the documents that a word appears in.
Now when someone comes to Google and they type in Angelina Jolie, you want to say, OK, what documents might match Angelina Jolie?
Well, document one has Angelina, but it doesn’t have Jolie. So it’s out.
Document number 2 has both the Angelina and Jolie so that’s the possibility.
Document eight has Jolie but not Angelina.
89 and 73 are out because they don’t have the right combination of words.
555 has both Angelina and Jolie. And then these two are also out.
So when someone comes to Google and they type in Chicken Little, Britney Spears, Shadab Alam, Katy Perry, whatever it is, it finds the documents that we believe have those words, either on the page or maybe in backlinks, in anchor text pointing to that document.
Once you have done what’s called document selection, you try to figure out, how should you rank those?
Ranking The Pages
Ranking the pages are a really tricky task to perform. Google use page rank, as well as over 200 other factors in its rankings.
Let us try to say, OK, maybe this document is really authoritative. It has a lot of reputation because it has a lot of page rank. But it only has the word Jolie once.
And it just happens to have the word Angelina somewhere else on the page whereas here is a document that has the word Angelina and Jolie right next to each other, so there is proximity.
And it’s got a lot of reputation. It’s got a lot of links pointing to it.
So Google tries to balance that off.
You want to find reputable documents that are also about what the user typed in.
And that’s kind of the secret sauce, trying to figure out a way to combine those 200 different signals in order to find the most relevant document.
So at any given time, hundreds of millions of times a day, someone comes to Google, it tries to find the closest data centre to them.
We type in something like Angelina Jolie, it sends that query out to hundreds of different machines all at once, which look through their little tiny fraction of the web that it has indexed.
And it finds, OK, these are the documents that we think the best match and then Google takes that page and tries to show it with a useful snippet. So that you decide whether the documents is a better fit for you.
If you leave the page immediately then Google understands that “This particular document does not suit the right keyword and will push that page back in the ranking“.
This is how the Google Algorithm works.
I hope it gives you a little bit of a feel about how does Google rank search results? How does the crawling system work? How Google index documents? How things get returned in under half a second through that massive parallelization?
I believe that it helps you and if you want to know more, there’s a whole bunch of articles and academic papers about Google, and page rank, and how Google works.