Why Google Doesn’t Crawl and Index Every URL
Google’s John Mueller has written a very detailed and honest explanation of why Google (and third-party SEO tools) don’t crawl and index every URL or link on the web. He explained that crawling is not objective, that it is expensive, that it can be inefficient, that the web changes a lot, that there is spam and junk mail and that all of this must be taken into account. account.
John wrote this detailed answer on Reddit answering why “Why don’t SEO tools show all backlinks?” But it answered it from a Google search perspective. He said:
There is no objective way to properly crawl the web.
It’s theoretically impossible to crawl everything, because the number of actual URLs is effectively infinite. Since no one can afford to keep an infinite number of URLs in a database, all web crawlers make assumptions, simplifications, and assumptions about what is actually worth crawling.
And even then, for practical reasons, you can’t explore it all all the time, the internet doesn’t have enough connectivity and bandwidth for that, and it’s very expensive if you want to access many pages. (for the crawler and for the site owner).
After that, some pages change quickly, others haven’t changed for 10 years. So crawlers try to save effort by focusing more on the pages they expect to change, rather than the ones they expect not to change.
And then we get to the part where the crawlers try to figure out which pages are actually useful. The web is filled with junk that no one cares about, pages that have been spammed into uselessness. These pages may still change regularly, they may have reasonable URLs, but they are just for dumping, and any search engine that cares about their users will ignore them. Sometimes it’s not just about bric-a-brac either. More and more, sites are technically correct, but simply do not reach “the bar” from a quality point of view to merit further exploration.
Therefore, all crawlers (including SEO tools) work on a very simplified set of URLs, they have to figure out how often to crawl, which URLs to crawl more often, and which parts of the web to ignore. There are no fixed rules for any of this, so each tool will have to make its own decisions along the way. This is why search engines have different content indexed, why SEO tools list different links, why all the metrics built on them are so different.
I thought it would be good to point this out as it is useful for SEOs to read and understand.
Discussion forum on Reddit.