Toolypet - Developer Tools Collection

Robots.txt 的作用和局限性

robots.txt 是位于网站根目录的文本文件,用于指示搜索引擎爬虫哪些页面可以爬取。作为1994年制定的 Robots Exclusion Protocol 的一部分,这是从互联网早期就开始使用的标准。

https://example.com/robots.txt

但有一点必须明确:robots.txt 是建议性的。行为良好的搜索引擎机器人会遵守它,但恶意机器人或网络爬虫可能会忽略。因此不应该将 robots.txt 作为安全工具使用。敏感数据必须通过认证和访问控制来保护。

爬虫如何处理 robots.txt

搜索引擎爬虫访问网站时首先检查 robots.txt。处理顺序如下:

请求 /robots.txt
如果文件不存在 → 允许爬取所有页面
如果文件存在 → 按规则爬取
查找适用于自己的 User-agent 块
应用最具体的规则

规则匹配的重要一点是,当 Allow 和 Disallow 应用于同一路径时,更具体的规则优先。相同具体性时 Allow 优先。

语法详细指南

基本指令

User-agent: *
Disallow: /admin/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 10

指令	说明	示例
User-agent	指定目标爬虫 (`*`表示所有机器人)	`User-agent: Googlebot`
Disallow	禁止爬取的路径	`Disallow: /private/`
Allow	允许爬取的路径 (Disallow 的例外)	`Allow: /private/open/`
Sitemap	站点地图 URL (绝对路径)	`Sitemap: https://...`
Crawl-delay	爬取间隔(秒) - 仅部分机器人支持	`Crawl-delay: 10`

通配符和路径结束标记

robots.txt 支持有限的模式匹配:

* - 任意字符串 (0个或更多)
$ - 路径结束

# 阻止所有 .pdf 文件
Disallow: /*.pdf$

# 阻止带查询参数的 URL
Disallow: /*?

# 阻止特定参数
Disallow: /*?sort=
Disallow: /*?filter=

# 阻止包含会话 ID 的 URL
Disallow: /*sessionid

# 阻止特定目录的所有 PHP 文件
Disallow: /scripts/*.php$

注意: 通配符仅用于路径匹配。不支持正则表达式。

主要搜索引擎和 AI 爬虫

了解访问网站的主要机器人很有用:

搜索引擎机器人

User-agent	服务	说明
Googlebot	Google 搜索	网页搜索用
Googlebot-Image	Google 图片	图片搜索用
Bingbot	Bing	Microsoft 搜索
Yeti	Naver	韩国搜索
Baiduspider	Baidu	百度搜索
DuckDuckBot	DuckDuckGo	隐私搜索

AI 爬虫 (2024年后激增)

User-agent	服务	说明
GPTBot	OpenAI	ChatGPT 训练数据
ChatGPT-User	OpenAI	ChatGPT 浏览功能
CCBot	Common Crawl	开放数据集收集
anthropic-ai	Anthropic	Claude 训练数据
Claude-Web	Anthropic	Claude 网页搜索
Google-Extended	Google	Gemini 训练数据

要阻止 AI 训练爬虫:

# 阻止 AI 训练爬虫
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

实战示例:按场景设置

全部允许 (默认)

User-agent: *
Allow: /

全部阻止 (开发/测试环境)

User-agent: *
Disallow: /

一般网站

User-agent: *
Allow: /

# 管理员区域
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /dashboard/

# 用户个人区域
Disallow: /account/
Disallow: /profile/
Disallow: /my-*/

# 搜索结果 (防止重复内容)
Disallow: /search/
Disallow: /*?q=
Disallow: /*?s=

# 临时/开发文件
Disallow: /tmp/
Disallow: /staging/
Disallow: /_*/

# 站点地图
Sitemap: https://example.com/sitemap.xml

电商网站

User-agent: *
Allow: /

# 购买流程 (不需要爬取)
Disallow: /cart/
Disallow: /checkout/
Disallow: /order/

# 用户账户
Disallow: /my-account/
Disallow: /wishlist/

# 筛选/排序的商品列表 (防止重复)
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*&

# 内部搜索
Disallow: /search/

# 允许比价机器人 (可选)
User-agent: PriceSpider
Allow: /

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/product-sitemap.xml

博客/媒体网站

User-agent: *
Allow: /

# 标签/分类页面 (可选 - 担心重复内容时)
Disallow: /tag/
Disallow: /category/page/

# 作者存档
Disallow: /author/

# 附件页面
Disallow: /attachment/

# 允许媒体文件直接访问
Allow: /wp-content/uploads/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml

常见错误和解决方案

1. 缺少斜杠

# 错误 - 阻止以"admin"开头的所有路径
Disallow: admin

# 正确 - 只阻止 /admin/ 目录
Disallow: /admin/

2. 大小写错误

指令名称区分大小写:

# 错误
user-agent: *
disallow: /admin/

# 正确
User-agent: *
Disallow: /admin/

3. 阻止 CSS/JS 导致的渲染问题

从2015年开始 Google 通过渲染页面来理解内容。阻止 CSS 和 JavaScript 可能导致 Google 无法正确理解页面:

# 错误 - 阻止渲染资源
Disallow: /css/
Disallow: /js/
Disallow: *.css$
Disallow: *.js$

# 正确 - 允许静态资源
Allow: /css/
Allow: /js/
Allow: /images/

4. 意外阻止整个网站

# 非常危险! 阻止整个网站
User-agent: *
Disallow: /

将此设置部署到生产环境可能导致从搜索结果中消失。

robots.txt vs meta robots vs X-Robots-Tag

三种方法有不同的用途和优先级:

方法	位置	用途	爬取	索引
robots.txt	根目录	爬取控制	O	X
meta robots	HTML head	索引控制	X	O
X-Robots-Tag	HTTP 头	非 HTML 资源	X	O

重要: 即使用 robots.txt 阻止,页面也可能出现在搜索结果中。如果其他网站链接,Google 会知道 URL,可能只显示 URL 而没有内容。

要从搜索结果中完全移除:

<!-- 在该页面的 head 中添加 -->
<meta name="robots" content="noindex, nofollow">

robots.txt 验证方法

Google Search Console

访问 Search Console
左侧菜单选择"设置" → "robots.txt"
使用"实时测试"功能确认 URL 是否被阻止

命令行确认

# 确认当前 robots.txt 内容
curl https://example.com/robots.txt

# 从特定机器人角度模拟 (需要 Python)
pip install robotexclusionrulesparser
python -c "
import robotexclusionrulesparser as rerp
rp = rerp.RobotExclusionRulesParser()
rp.fetch('https://example.com/robots.txt')
print(rp.is_allowed('Googlebot', '/admin/'))
"

爬取预算优化

对于大型网站,"爬取预算"(Crawl Budget)很重要。这是 Google 分配给网站的爬取资源量。用 robots.txt 阻止不必要的页面可以让重要页面更频繁地被爬取。

浪费爬取预算的页面:

筛选/排序的商品列表
包含会话 ID 的 URL
无限日历
内部搜索结果
打印页面

Toolypet Robots.txt Generator

复杂的 robots.txt 也能轻松生成:

选择要阻止的路径模式
设置特定机器人阻止
输入站点地图 URL
通过预览确认
下载完成的文件

用正确的 robots.txt 高效使用爬取预算,让重要内容在搜索结果中得到良好展示。