What Is robots.txt?

You've probably typed a URL like yoursite.com/robots.txt into a browser out of curiosity and landed on a plain text file with a handful of cryptic lines — "User-agent," "Disallow," "Sitemap." It looks minor. It isn't. That small file is one of the first things a search engine crawler reads before it touches anything else on your site.

Get it wrong and you can accidentally hide your entire site from Google. Leave it out entirely and crawlers just assume everything is fair game. Understanding what robots.txt actually does — and what it doesn't — is one of the fastest ways to avoid a silent, easy-to-miss SEO mistake.

Quick Answer

robots.txt is a plain text file placed at the root of a domain that tells search engine crawlers which parts of a site they're allowed to request. It uses simple rules like "Disallow" and "Allow" per user-agent, and can point crawlers to a sitemap. It doesn't remove pages from search results or restrict who can view a page — it only manages crawler access.

What is robots.txt, exactly?

robots.txt follows the Robots Exclusion Protocol, a long-standing convention that gives site owners a standardized way to talk to crawlers before those crawlers request anything else.

In short: robots.txt manages traffic at the front door. It decides who gets waved through to crawl a page — it says nothing about whether that page can later show up in search results.

Why robots.txt matters for SEO

A file this small has an outsized effect on how efficiently — and how safely — your site gets crawled:

📊 Quick stat A large share of accidental "why did my traffic drop to zero" cases trace back to one line: a leftover Disallow: / that never got removed after launch — not a penalty, not an algorithm update.

Step-by-step: creating and adding a robots.txt file

  1. Decide what actually needs blocking. Most sites need very few rules — think admin paths, internal search, or duplicate parameter URLs, not entire sections.
  2. List the user-agents you want to address. Use User-agent: * for a rule that applies to every crawler, or name a specific one if you need different treatment for it.
  3. Write your Disallow and Allow rules. Each rule is a path, not a full URL — Disallow: /admin/ blocks that folder for the user-agent listed directly above it.
  4. Add a Sitemap line. Point crawlers to your sitemap's full URL, such as Sitemap: https://yoursite.com/sitemap.xml, so they can discover your pages efficiently.
  5. Save the file as robots.txt. It must be plain text, named exactly robots.txt, with no capitalization changes.
  6. Upload it to the root of your domain. It has to be reachable at https://yoursite.com/robots.txt directly — a subfolder location won't be recognized.
  7. Test it before trusting it. Use Google Search Console's robots.txt report to confirm specific URLs are allowed or blocked exactly as intended.
Try the Rebrixe robots.txt Generator — free Build a correct, ready-to-upload robots.txt file in seconds. No coding required.
Generate robots.txt →

Common mistakes with robots.txt

1. Blocking the entire site by accident

A single leftover line like Disallow: / under a wildcard user-agent tells every crawler to skip the whole domain — a common survivor from a staging environment that never got cleaned up at launch.

2. Using robots.txt to try to remove pages from search

Disallowing a URL stops crawling, but a page that's already indexed, or one that other sites link to, can still show up in results without a description — the correct tool for removal is a noindex directive, not a Disallow rule.

3. Blocking CSS and JavaScript files

Search engines render pages the way a browser does, which means they need access to stylesheets and scripts. Blocking them can cause a page to be evaluated as broken or poorly built even when it looks fine to a visitor.

4. Assuming it's a privacy or security control

The file itself is public and readable by anyone, so listing a sensitive path in it can actually draw attention to that path rather than hiding it — real access control needs authentication, not a crawler directive.

💡 Pro tip After any site migration or launch, re-check robots.txt first. It's one of the most common places a temporary "block everything" rule from staging quietly makes it into production.

Real-world examples

How different types of sites typically shape their robots.txt rules:

Blog
Block internal search
Disallow: /?s=
Keeps thin, duplicate internal search-result pages out of the crawl queue while leaving articles fully open.
E-commerce store
Block cart and checkout
Disallow: /cart/
Excludes session-specific, non-indexable pages while keeping every product and category page crawlable.
SaaS product
Block the app, allow the marketing site
Disallow: /app/
Separates the logged-in product experience from the public marketing pages that actually need to rank.
Small business site
Minimal rules, full sitemap
Allow: /
Leaves the whole site open to crawling and simply points to the sitemap for efficient discovery.

Notice the pattern: every example blocks a narrow, specific path. None of them reach for broad, sweeping rules — that restraint is exactly what keeps robots.txt safe to use.

robots.txt vs. other crawl controls

robots.txt is one of several tools for managing how search engines treat your pages. Here's how it compares to the others.

Method Controls Stops indexing? Best for
robots.txt Crawling access No, not reliably Managing crawl budget, keeping bots out of low-value paths
Meta robots noindex Indexing of a page Yes Keeping a specific, crawlable page out of search results
X-Robots-Tag header Indexing, non-HTML files Yes Non-HTML files like PDFs or images that can't hold a meta tag
XML sitemap Discovery, not blocking No Helping crawlers find and prioritize pages you want indexed

Generate your robots.txt right now — free

The Rebrixe robots.txt Generator builds a clean, correctly formatted file with the rules you choose — no account, no watermark, and nothing to write by hand.

Free robots.txt Generator Pick your rules, add your sitemap, download the file.
Open robots.txt Generator →

Frequently asked questions

No. robots.txt asks crawlers not to visit a URL, but a page blocked this way can still appear in search results without a snippet if other sites link to it. A noindex meta tag or header is the correct way to keep a page out of search results entirely, and it requires the page to be crawlable so the tag can be read.
It must sit at the root of the domain, such as https://example.com/robots.txt. A copy placed in a subfolder or subdomain is ignored for the rest of the site; each subdomain needs its own robots.txt if it needs rules at all.
No. The file is publicly readable by anyone, including people, and it only carries weight with crawlers that choose to respect it. It is not a security or privacy tool, and sensitive URLs should never be listed there as a way to hide them.
Major search engines like Google and Bing follow the standard, but obedience is voluntary by design, and some bots ignore the file entirely. Blocking abusive or unwanted bots reliably usually requires server-level rules in addition to robots.txt.
Crawlers treat a missing file as an open invitation and assume the entire site is crawlable, then request the URL and receive a 404. That's a valid, safe state for many small sites and doesn't cause an error on its own.
Yes. Google Search Console's robots.txt report and similar third-party testers let you check whether a specific URL is allowed or blocked under the current rules before you rely on them in production.
Generally no. Search engines need to fetch these files to render the page the way a visitor sees it, and blocking them can lead to a page being misjudged as broken or lower quality than it actually is.

Build your robots.txt in seconds

The Rebrixe robots.txt Generator creates a clean, correctly formatted file with exactly the rules you need — no account, no watermark, just a ready-to-upload file.

Launch the robots.txt Generator →
← Back to blogs