What is crawl budget?

Crawl budget is the rough amount of crawling capacity search engines allocate to your site. It's influenced by site health, performance, and how much new or updated content you publish.

Does blocking in robots.txt remove a page from Google?

Not necessarily. It can prevent crawling, but URLs can still be indexed in limited ways. Use noindex (when crawlable) or proper removals for deindexing.

Why are my new pages not being crawled?

Common causes include weak internal links, missing sitemap submission, slow server responses, robots/noindex issues, or too many low-value URLs consuming crawl capacity.

Crawling: how search engines discover pages (and what blocks them)

Crawling is the process where search engine bots (like Googlebot) discover and fetch URLs.

It’s step one. If a page isn’t crawled, it won’t be indexed. If it isn’t indexed, it can’t rank.

Simple, but crawling failures are surprisingly common.

How crawling usually happens

Search engines typically discover URLs from:

internal links
XML sitemaps
external links
redirects and canonical hints (sometimes)

If your page has no internal links pointing to it, you’re basically hiding it.

Crawlability blockers (the usual suspects)

If crawling is broken, I check these first:

robots.txt

Robots rules can block crawling entirely. This is often accidental after a staging-to-production switch.

Noindex doesn’t block crawling; it tells engines not to index. But it still matters because people sometimes noindex important pages without noticing.

Also watch out for X-Robots-Tag headers, especially on PDFs.

Canonical issues

A canonical pointing to the wrong page can make the crawled page effectively invisible in the index, even if it’s fetched.

Soft 404 and thin pages

Pages that look like errors (empty templates, “not found” content) but return 200 OK often get treated as soft 404s. Google crawls them, then decides they’re not worth indexing.

Crawl budget (when you should care)

For small sites, crawl budget is rarely the main bottleneck.

You should care when:

you have tens/hundreds of thousands of URLs
you generate infinite URL parameters (filters, sessions)
your server is slow or unstable
you publish lots of new pages and they take weeks to show up

In those cases, cleaning up low-value URLs and improving server response time can make crawling noticeably better.

Practical crawl diagnostics

If you’re debugging crawling, these checks are worth doing:

Does the page have at least one internal link from a crawlable page?
Is the page in your XML sitemap (and is the sitemap clean)?
Does the page return a normal 200 response and render content?
Is the server fast enough for bots (TTFB matters)?

If you want a fast broad scan, start with SEO Audit Tool. It helps catch robots/canonical/soft 404 patterns that are hard to see one URL at a time.

Link back to the glossary

One-line definition: Crawling in the Glossary.

How crawling usually happens

Crawlability blockers (the usual suspects)

robots.txt

noindex (and related headers)

Canonical issues

Soft 404 and thin pages

Crawl budget (when you should care)

Practical crawl diagnostics

Link back to the glossary

Related wiki terms

Privacy & Cookies

Privacy & Cookies

gdpr.settings