Definition
Robots.txt is a plain text configuration file located at the root of a domain (example.com/robots.txt) that communicates directives to search engine crawlers. It uses the REP (Robots Exclusion Protocol) to allow or disallow access to certain parts of the site via Allow and Disallow directives. This file is essential for managing crawl budget by preventing exploration of unnecessary pages (admin pages, filters, duplicates) and for protecting sensitive areas. Important: robots.txt blocks crawling but not necessarily indexing. A page blocked by robots.txt can still appear in results if other pages reference it. To prevent indexing, the meta noindex tag is more appropriate.
Key Points
- Robots.txt blocks crawling but not indexing: use noindex to prevent appearing in results
- It must be placed at the domain root and be publicly accessible
- An error in robots.txt can block the entire site: always test before deployment
Practical Examples
Blocking the admin area
A WordPress site adds 'Disallow: /wp-admin/' in its robots.txt to prevent Googlebot from crawling back-office pages, saving crawl budget for public pages.
Sitemap reference
By adding 'Sitemap: https://www.mysite.com/sitemap.xml' in the robots.txt, a webmaster ensures all crawlers easily discover the sitemap.
Frequently Asked Questions
Create a text file named 'robots.txt' at the root of your site. The basic syntax uses 'User-agent:' to target a specific bot (or * for all), 'Disallow:' to block access to a path, and 'Allow:' to authorize access. Also add the 'Sitemap:' line with your sitemap URL. Always test via the Search Console robots.txt testing tool before going live.
No, robots.txt prevents crawling but not necessarily indexing. If other sites link to a blocked page, Google may still index it with a generic title and description. To truly prevent indexing, use the meta 'noindex' tag or the X-Robots-Tag HTTP header.
Go Further with LemmiLink
Discover how LemmiLink can help you put these SEO concepts into practice.
Last updated: 2026-02-07