Robots.txt Best Practices
Your robots.txt is a directive file at yourdomain.com/robots.txt — the first thing every crawler reads. Getting it wrong wastes crawl budget, leaks private pages, or accidentally blocks Googlebot from indexing your site entirely.
Where does it go?
Upload to your root directory. Must be accessible at domain.com/robots.txt — placing it in any subfolder means crawlers will never find it.
Blocking AI Bots
As of 2024–2025, 15+ AI companies run training crawlers. Blocking them prevents your content being used in LLM training without permission or compensation.
Crawl Delay
If your server is under load, Crawl-delay: 10 tells crawlers to wait 10 seconds between requests. Google ignores it, but Bing and others honour it.
robots.txt ≠ Security
This file is publicly readable. Never rely on it to hide sensitive data — use server-level authentication for truly private pages. It controls cooperative crawlers only.
Multiple Sitemaps
Large sites split sitemaps by content type: sitemap-posts.xml, sitemap-products.xml. List all of them — Google will discover and queue each one.
Test Before Deploying
Use Google Search Console's robots.txt tester (linked above) to confirm your rules before going live. A single typo can accidentally block your entire site.