In a Search Off the Record episode, Gary Illyes, Lizzi Sassman, and Dave Smart debunked myths about crawling — specifically crawl budgets and how Google prioritizes crawling.
Many speculations have been made about crawl budgets, with sites assuming that they must stay within an allotted budget for their pages to be indexed. However, according to Google’s Search Relations team, this seems untrue.
What Is a Crawler?
First and foremost, a crawler is a software that fetches information and resources from websites. Smart states:
“So if a search engine wants to index and rank something, it needs to go and fetch it first, so they use a crawler. They go and fetch whatever they need to do and then that can go through all the wonderful indexing and ranking stuff. But to get to that point first you have to go download something. You have to go and fetch something first.”
This explanation relates explicitly to the Googlebot crawler and how it works. While there are many crawlers for various purposes, Illyes clarifies:
“Internally, what Dave said is pretty much what we have for Google. Basically we have a piece of software that is fetching from the internet on the request of a team or an individual, like a Googler. You instruct it to fetch something from the internet, and then it will schedule that for fetching and in a few maybe seconds or minutes or even hours, if the web server is overloaded, then you will get back whatever you asked it to fetch.”
So, a crawler fetches information based on request. But how does it know where to start? Smart explains:
“You kind of need to do it by looking at what's known, finding somewhere to start, a starting point. And from that you get the links and stuff, and then you would try and determine what's important to go and fetch now, and maybe what can wait until later and maybe what's not important at all.”
Sassman then asks how the crawler would decide what’s important versus not important. Smart answers that the crawler should be asking:
- If the information has already been crawled;
- Does the information come from somewhere important;
- Is the information spammy?
Sassman explains that while the crawl budget is in the Google documentation, it is a concept that was supposed to help clarify that there are finite resources with crawling.
I mean, it's a concept, I guess, more so to explain to people that there are finite resources. At least that's how I'm interpreting it, is that we need some kind of vehicle to explain that we can't charge you more and you can't charge us more, and that there's a limit to these things. Charge. Well, it's budget, I don't know, money.
According to Illyes, a crawler has two main components. First, there is the scheduler, which determines which pages to crawl. Second, there is the limiter in the fetchers, which is responsible for ensuring that Google doesn't inundate the sites. Although crawling is limited, it cannot be increased by paying for it.
How Does Content Affect Crawling?
What does that mean from a content standpoint? The more the crawler fetches your content, the more it crawls your site. Illyes explains:
“But from the perspective of search, which is the only thing that I know how works at Google, basically if search demand goes down, then that also correlates to the crawl limit going down. So if you want to increase how much we crawl, then you somehow have to convince search that your stuff is worth fetching, which is basically what the scheduler is listening to.”
According to Google, search demand could refer to search queries. In the current context, if the number of search queries for a particular topic decreases, websites optimized for that query may not be crawled as frequently as before.
Additionally, not every URL on a site is indexed because those pages may not be deemed as important to users. Illyes explains:
“I think I would prefer if people would try to identify what are the important URLs that they actually care about to show up in search versus I want everything from my site to show up because like, realistically, we don't have infinite storage space, so we have to wrangle with what we put in the index and what we exclude.”
So, SEOs must determine which URLs they want to appear in the search results.
How to Get Your Site Crawled More
While Googlebot has a finite amount of crawling, there are still ways to get your site crawled more. Illyes states:
“To get more URLs, that's basically just how sought after your content is in search, because if we see that people are looking for your content and they are linking to your content, then that will naturally increase how much search wants to crawl from your site.”
By focusing on the usefulness of your content, your site can benefit from more crawling. Illyes explains:
“Scheduling is very dynamic. As soon as we get the signals back from search indexing that the quality of the content has increased across this many URLs, we would just start turning up demand.”
Improving page experience can also lead to faster site crawling by Googlebot. Utilizing sitemaps and internal links can optimize the crawling process and benefit your site long-term.
Summary
Google recently provided new information about how web crawling works, which contradicts the idea of crawl budgets. According to Google, crawling is dynamic and primarily determined by the quality and usefulness of the content on a website.
Therefore, website owners should focus on creating high-quality content to encourage more crawling without worrying about hitting any limits.