Crawling: how search engines discover pages (and what blocks them)

Crawling is how bots discover your pages. Learn crawl paths, crawl budget basics, and common blockers like robots.txt, noindex, and soft 404s.

2026-03-07
·
2 min read

Crawling is the process where search engine bots (like Googlebot) discover and fetch URLs.

It’s step one. If a page isn’t crawled, it won’t be indexed. If it isn’t indexed, it can’t rank.

Simple, but crawling failures are surprisingly common.

How crawling usually happens

Search engines typically discover URLs from:

  • internal links
  • XML sitemaps
  • external links
  • redirects and canonical hints (sometimes)

If your page has no internal links pointing to it, you’re basically hiding it.

Crawlability blockers (the usual suspects)

If crawling is broken, I check these first:

robots.txt

Robots rules can block crawling entirely. This is often accidental after a staging-to-production switch.

Noindex doesn’t block crawling; it tells engines not to index. But it still matters because people sometimes noindex important pages without noticing.

Also watch out for X-Robots-Tag headers, especially on PDFs.

Canonical issues

A canonical pointing to the wrong page can make the crawled page effectively invisible in the index, even if it’s fetched.

Soft 404 and thin pages

Pages that look like errors (empty templates, “not found” content) but return 200 OK often get treated as soft 404s. Google crawls them, then decides they’re not worth indexing.

Crawl budget (when you should care)

For small sites, crawl budget is rarely the main bottleneck.

You should care when:

  • you have tens/hundreds of thousands of URLs
  • you generate infinite URL parameters (filters, sessions)
  • your server is slow or unstable
  • you publish lots of new pages and they take weeks to show up

In those cases, cleaning up low-value URLs and improving server response time can make crawling noticeably better.

Practical crawl diagnostics

If you’re debugging crawling, these checks are worth doing:

  • Does the page have at least one internal link from a crawlable page?
  • Is the page in your XML sitemap (and is the sitemap clean)?
  • Does the page return a normal 200 response and render content?
  • Is the server fast enough for bots (TTFB matters)?

If you want a fast broad scan, start with SEO Audit Tool. It helps catch robots/canonical/soft 404 patterns that are hard to see one URL at a time.

One-line definition: Crawling in the Glossary.

Privacy & Cookies

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies.