SEO, Industry News, Educative Content

Here’s How Google’s Crawling Infrastructure Actually Works, According to Google’s Own Engineers.

Blog Image

Rasit

Mar 19, 20268 min read

Google’s Gary Illyes and Martin Splitt sat down on the Search Off The Record podcast (Episode 105, published March 12, 2026) and explained, in more detail than Google has shared publicly before, how the company’s crawling infrastructure actually operates. The episode carries a simple title (“Google crawlers behind the scenes”) but the content is surprisingly technical, and several of the details change how SEOs should think about crawl limits, crawl budgets, and what “Googlebot” even means.

The central disclosure: Googlebot is not a standalone program. It never was. It’s one client of an internal software-as-a-service platform that handles all of Google’s web crawling across every product.

Googlebot Is a Client, Not the System

Illyes describes the name “Googlebot” as a misnomer that made sense in the early 2000s when Google had a single product and therefore a single crawler. The reality in 2026 is very different.

What powers web crawling at Google today is an internal SaaS platform. Illyes gives it the placeholder name “Jack” for the purposes of the discussion, noting that the real internal name isn’t relevant to the public. Jack exposes API endpoints. Any Google team that needs to fetch content from the internet makes an API call to Jack, passing parameters like the desired user agent string, the robots.txt product token to obey, and timeout thresholds. The team doesn’t manage the crawling mechanics itself. Jack handles those centrally.

Illyes summarizes the entire system in one sentence: “It’s basically you tell it, fetch something from the internet without breaking the internet. And then it will do that if the restrictions on the site allow it.”

Splitt clarifies the architecture further: Googlebot is not the infrastructure. It’s just the name that one particular team (Google Search) uses for their fetches sent through the central SaaS. Google News, Google Shopping, AdSense, Gemini, NotebookLM, and every other product that needs web content all send their crawl requests through the same underlying system.

Google’s crawling infrastructure documentation at developers.google.com lists Googlebot-News, Storebot-Google, Google-Extended, the AdSense crawler, and Google-NotebookLM as distinct products sharing the same foundation.

The 15MB Limit Is an Infrastructure Default, Not a Googlebot Rule

One of the most cited crawl limits in SEO has been the 15MB file size limit. The podcast clarifies exactly where that number comes from and how it actually works.

Illyes explains that the 15MB limit is set at the infrastructure level, meaning it’s the default for the entire SaaS platform. Any crawler that doesn’t explicitly override that setting gets a 15MB limit. The system starts fetching bytes from the server, runs an internal counter, and when it reaches 15MB, stops receiving data and signals the server to stop sending.

But individual teams can override the default, and they do. Google Search (Googlebot) overrides it downward to 2MB for HTML and supported text-based files. The PDF team overrides it upward to 64MB because PDF documents can be enormous (Illyes notes that the HTTP standard exported as a PDF is around 96MB).

The reason for the override is infrastructure protection. Processing a 14MB HTML file through rendering, conversion, and indexing would overwhelm Google’s systems given the volume of pages they crawl. The 2MB limit isn’t about what’s technically possible to fetch. It’s about what’s sustainable to process at scale.

Illyes puts it directly: “There’s a bunch of things that are for our own protection or our infrastructure’s protection.”

Crawlers vs Fetchers: A Distinction That Affects robots.txt

The podcast introduces a distinction that has practical implications for how sites manage crawler access.

Google’s infrastructure has two types of clients: crawlers and fetchers. Crawlers run continuously, processing a constant stream of URLs for a given team. No human is waiting on the other end. They operate on behalf of automated systems like Search indexing, Shopping feeds, and news aggregation.

Fetchers operate on a single URL at a time and are always triggered by a user action. Someone clicks a button, shares a link, or requests a preview, and the fetcher goes and gets the page. Illyes explains: “Basically, there’s someone on the other end who’s waiting for the response.”

The practical significance: fetchers generally ignore robots.txt rules because they act in response to explicit human action rather than automated crawling. When Google added Google Messages to its user-triggered fetchers list in January 2026 (the bot that generates link previews in Google Messages chat threads), that fetcher doesn’t obey robots.txt because a human initiated the request.

For site owners managing crawler access, the distinction means robots.txt controls apply to the automated crawlers (Googlebot, Storebot-Google, etc.) but not to user-triggered fetchers. Blocking Googlebot in robots.txt prevents Search indexing but doesn’t prevent a Google Messages user from generating a preview of a shared link.

What the Limits Look Like in Practice

The documentation clarification from February 2026 reorganized where these numbers live. Here’s how they stack up:

The infrastructure default across all Google crawlers and fetchers is 15MB. Any content beyond that limit is ignored. Individual products can set different limits for their crawlers and for different file types.

For Google Search specifically, Googlebot crawls the first 2MB of HTML and supported text-based files and the first 64MB of PDFs. CSS and JavaScript files referenced in the HTML are each fetched separately, and each fetch is bound by the file size limit independently of the main document.

Once any of these limits are reached, Googlebot stops the fetch and sends only the already-downloaded portion for indexing. The limit applies to uncompressed data, which is an important detail for measurement.

Mueller clarified on Bluesky that “2MB of HTML is quite a bit” and that it’s extremely rare for sites to run into issues. According to the HTTP Archive Web Almanac, the median HTML page on mobile is roughly 33KB, and 90% of pages have less than 151KB of HTML. Hitting the 2MB ceiling requires a page roughly 67 times larger than average.

The Silent Truncation Problem

The most concerning finding from independent testing (notably by Spotibo in February 2026) is what happens when a page exceeds the limit.

Google Search Console doesn’t flag it. The URL status shows “URL is on Google” and “Page is indexed.” Everything looks normal. But the actual indexed content gets silently truncated at the 2MB mark. Spotibo’s test showed a 3MB HTML file cut off mid-word around line 15,210, with the text literally stopping at “Prevention is b” before the closing HTML tag.

For files well above 15MB that exceed the infrastructure-level limit entirely, Google can’t even process the indexing request. And in neither case does Search Console provide a clear explanation.

The URL Inspection tool doesn’t help either, because it uses the “Google-InspectionTool” crawler rather than Googlebot itself. That crawler operates under the general 15MB fetch limit, not the 2MB indexing limit, which means it shows content that Googlebot would actually truncate. The tool most SEOs reach for to verify crawling is, for this specific issue, actively misleading.

Why the “Not Monolithic” Part Is the Biggest Takeaway

Splitt closes the relevant section of the podcast by affirming that Google’s crawling infrastructure is “not monolithic.” The system is flexible in terms of fetch limits and other configurations, and he describes it as software-as-a-service where web search is one client among many.

For SEOs, the implication is that thinking about “Googlebot” as one thing with one set of behaviors is increasingly inaccurate. Different Google products crawl with different user agents, different file size limits, different frequency patterns, and different robots.txt tokens. A page that’s fully accessible to Googlebot for Search might be handled differently by Storebot-Google for Shopping or Google-Extended for Gemini.

As Google introduces more AI-driven products and specialized crawlers, the number of clients using the central crawling SaaS will likely grow. The November 2025 documentation migration that separated crawling docs from Search Central was the first signal. The February 2026 file size limit clarification was the second. The podcast episode providing the architectural explanation is the third.

What to Do With All of This

For most sites, none of these limits require action. The vast majority of HTML pages are well under 2MB, and standard CSS and JavaScript bundles from well-optimized sites stay within safe thresholds.

The sites that should pay attention are those with genuinely large HTML pages: sites with massive inline data tables, sites embedding large amounts of inline JavaScript or CSS rather than external files, sites with extremely long-form single-page content, and sites with bloated template markup that adds significant overhead to every page.

For those sites, the practical steps are straightforward. Move inline CSS and JavaScript into external files so they’re fetched separately and don’t count against the HTML size limit. Keep the most important content (headings, primary text, structured data, metadata) near the top of the HTML source so it falls within the first 2MB even if the rest gets truncated. Use Search Console’s crawl stats to monitor how Googlebot interacts with the site. And search for a quote from deep in a page to verify that Google indexed the full content, which is the method Mueller recommends over checking byte counts.

For link-building campaigns and digital PR placements, the relevance is about the pages being linked to. If a landing page receiving backlinks has bloated HTML that gets silently truncated, the content and internal links below the cutoff point are invisible to Google’s index. Verifying that target pages are fully indexed before investing in external link acquisition is a basic step that becomes more important with clearer knowledge of where the cutoffs sit.

Watch the full podcast episode: https://youtu.be/JpweMBnpS4Q