robots.txt is a standard used by websites to communicate with web crawlers and other web robots. This file, typically located in the root directory of a website, provides instructions about which parts of the site should not be crawled or indexed by these automated agents.
The concept of robots.txt emerged in the mid-1990s as the web grew and webmasters needed a way to control the behavior of search engine bots. The first known robots.txt file was introduced by Martijn Koster, a software developer, in 1994. He proposed the original Robots Exclusion Standard to help manage the load on servers by preventing unnecessary crawling.
The robots.txt file uses simple directives:
User-agent:
Specifies which crawler the rule applies to. Use "*" for all crawlers.Disallow:
Indicates paths that should not be crawled.Allow:
(Optional) Specifies paths that are allowed to be crawled, overriding a Disallow rule.Crawl-delay:
Suggests a delay between successive crawls to reduce server load.Sitemap:
Points to the site's XML sitemap.
User-agent: *
Disallow: /private/
Allow: /private/public/
Crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml
The robots.txt standard has seen updates over the years to address new needs:
clean-param
for handling URL parameters (Google Developers Blog).web-crawlers sitemap-xml seo-best-practices website-security