You've probably typed a URL like yoursite.com/robots.txt into a browser out of
curiosity and landed on a plain text file with a handful of cryptic lines — "User-agent,"
"Disallow," "Sitemap." It looks minor. It isn't. That small file is one of the first things
a search engine crawler reads before it touches anything else on your site.
Get it wrong and you can accidentally hide your entire site from Google. Leave it out entirely and crawlers just assume everything is fair game. Understanding what robots.txt actually does — and what it doesn't — is one of the fastest ways to avoid a silent, easy-to-miss SEO mistake.
robots.txt is a plain text file placed at the root of a domain that tells search engine crawlers which parts of a site they're allowed to request. It uses simple rules like "Disallow" and "Allow" per user-agent, and can point crawlers to a sitemap. It doesn't remove pages from search results or restrict who can view a page — it only manages crawler access.
What is robots.txt, exactly?
robots.txt follows the Robots Exclusion Protocol, a long-standing convention that gives site owners a standardized way to talk to crawlers before those crawlers request anything else.
- It's a request, not a lock. The file tells well-behaved crawlers what they shouldn't request. It has no technical ability to stop a browser, a person, or a non-compliant bot from reaching a URL directly.
- It works by user-agent. Rules can target a specific crawler, like Googlebot, or apply to every bot using a wildcard, so different crawlers can be given different levels of access.
- Two core directives do most of the work.
Disallowmarks a path as off-limits to crawling, whileAllowcarves out an exception inside a disallowed folder. - It can point to your sitemap. A
Sitemap:line gives crawlers a direct path to your full list of URLs, independent of the disallow/allow rules.
In short: robots.txt manages traffic at the front door. It decides who gets waved through to crawl a page — it says nothing about whether that page can later show up in search results.
Why robots.txt matters for SEO
A file this small has an outsized effect on how efficiently — and how safely — your site gets crawled:
- It protects your crawl budget. Blocking low-value paths like internal search results or filtered URL parameters frees up crawler attention for pages that actually matter to rankings.
- It keeps admin and staging areas out of the way. Login pages, cart pages, and staging subfolders rarely need to be crawled, and excluding them keeps crawl activity focused.
- One mistake can de-index a whole site. A single stray
Disallow: /left over from a staging environment can quietly block Google from crawling a live production site. - It's the first file crawlers check. Search engines fetch robots.txt before crawling anything else on a domain, so its rules take effect immediately on every subsequent crawl.
Disallow: / that never got removed after launch — not a
penalty, not an algorithm update.
Step-by-step: creating and adding a robots.txt file
- Decide what actually needs blocking. Most sites need very few rules — think admin paths, internal search, or duplicate parameter URLs, not entire sections.
-
List the user-agents you want to address. Use
User-agent: *for a rule that applies to every crawler, or name a specific one if you need different treatment for it. -
Write your Disallow and Allow rules. Each rule is a path, not a full URL —
Disallow: /admin/blocks that folder for the user-agent listed directly above it. -
Add a Sitemap line. Point crawlers to your sitemap's full URL, such as
Sitemap: https://yoursite.com/sitemap.xml, so they can discover your pages efficiently. -
Save the file as robots.txt. It must be plain text, named exactly
robots.txt, with no capitalization changes. -
Upload it to the root of your domain. It has to be reachable at
https://yoursite.com/robots.txtdirectly — a subfolder location won't be recognized. - Test it before trusting it. Use Google Search Console's robots.txt report to confirm specific URLs are allowed or blocked exactly as intended.
Common mistakes with robots.txt
1. Blocking the entire site by accident
A single leftover line like Disallow: / under a wildcard user-agent tells every
crawler to skip the whole domain — a common survivor from a staging environment that never
got cleaned up at launch.
2. Using robots.txt to try to remove pages from search
Disallowing a URL stops crawling, but a page that's already indexed, or one that other sites link to, can still show up in results without a description — the correct tool for removal is a noindex directive, not a Disallow rule.
3. Blocking CSS and JavaScript files
Search engines render pages the way a browser does, which means they need access to stylesheets and scripts. Blocking them can cause a page to be evaluated as broken or poorly built even when it looks fine to a visitor.
4. Assuming it's a privacy or security control
The file itself is public and readable by anyone, so listing a sensitive path in it can actually draw attention to that path rather than hiding it — real access control needs authentication, not a crawler directive.
Real-world examples
How different types of sites typically shape their robots.txt rules:
Notice the pattern: every example blocks a narrow, specific path. None of them reach for broad, sweeping rules — that restraint is exactly what keeps robots.txt safe to use.
robots.txt vs. other crawl controls
robots.txt is one of several tools for managing how search engines treat your pages. Here's how it compares to the others.
| Method | Controls | Stops indexing? | Best for |
|---|---|---|---|
| robots.txt | Crawling access | No, not reliably | Managing crawl budget, keeping bots out of low-value paths |
| Meta robots noindex | Indexing of a page | Yes | Keeping a specific, crawlable page out of search results |
| X-Robots-Tag header | Indexing, non-HTML files | Yes | Non-HTML files like PDFs or images that can't hold a meta tag |
| XML sitemap | Discovery, not blocking | No | Helping crawlers find and prioritize pages you want indexed |
Generate your robots.txt right now — free
The Rebrixe robots.txt Generator builds a clean, correctly formatted file with the rules you choose — no account, no watermark, and nothing to write by hand.