控制Google爬行和索引你的网站

Google SEO中，如何控制Google爬行和索引你的网站是非常重要的技能。今天和大家分享一下这方面的心得。

对Google爬虫的控制一般有三种方法，分别是

robots.txt
robots meta tag
x-robots-tag

下面分别阐述每一种方法的规定和用法。

1. Robots.txt文件：

Robots.txt文件由user-agent, disallow, allow, sitemap四个属性构成。其中user-agent用来区分不同的group，而disallow和allow是group的元素，sitemap不受group限制，为所有group公用。小型完整中使用的robots.txt文件都很简单，一般不会出错。但如果比较大型的网站的话，robots.txt文件就会比较复杂。这时候最容易产生问题的就是group的优先级以及group元素的优先级。

首先来看group的优先级。对于特定的爬虫，只有一个user-agent会生效。user-agent的特殊性越高，优先级就越高。当一个user-agent对某一个爬虫生效之后，其他的user-agent组将失效。不能与爬虫匹配的user-agent将被忽略。user-agent的顺序与优先级没有关系。下面是一个例子：

user-agent: googlebot-news
(group 1)

user-agent: *
(group 2)

user-agent: googlebot
(group 3)

搜索引擎爬虫	生效的group	注释
Googlebot News	(group 1)	user-agent的特殊性越高，优先级就越高
Googlebot (web)	(group 3)
Googlebot Images	(group 3)	`没有专门针对googlebot images的group，所以更一般的group生效`
Googlebot News (when crawling images)	(group 1)	These images are crawled for and by Googlebot News, therefore only the Googlebot News group is followed.
Otherbot (web)	(group 2)
Otherbot (News)	(group 2)

再来看group元素的优先级。group元素格式为disallow/allow:url。其中，url必须以“/”开头，代表了url路径从根目录开始。如果没有“/”作为url开始标记的话，爬虫也会假定url从根目录开始。url对大小写敏感。url支持两个正则表达式符号：

*:代表0个或多个字符。
$: 标志着url的结束。

下面是个url匹配的例子：

URL	能匹配的路径	不能匹配的路径	注释
/	any valid URL		Matches the root and any lower level URL
`/*`	equivalent to /	equivalent to /	Equivalent to “/” — the trailing wildcard is ignored.
`/fish`	/fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything	/Fish.asp /catfish /?id=fish	Note the case-sensitive matching.
`/fish*`	/fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything	/Fish.asp /catfish /?id=fish	Equivalent to “/fish” — the trailing wildcard is ignored.
`/fish/`	/fish/ /fish/?id=anything /fish/salmon.htm	/fish /fish.html /Fish/Salmon.asp	The trailing slash means this matches anything in this folder.
`fish/`	equivalent to /fish/	equivalent to /fish/	equivalent to /fish/
`/*.php`	/filename.php /folder/filename.php /folder/filename.php?parameters /folder/any.php.file.html /filename.php/	/ (even if it maps to /index.php) /windows.PHP
`/*.php$`	/filename.php /folder/filename.php	/filename.php?parameters /filename.php/ /filename.php5 /windows.PHP
`/fish*.php`	/fish.php /fishheads/catfish.php?parameters	/Fish.PHP

对于group元素的优先级和group的优先级类似，特殊性越高，优先级越高。一般来说，长的url优先级高于短的url。下面是个例子：

URL	allow:	disallow:	Verdict
http://example.com/page	`/p`	`/`	allow
http://example.com/folder/page	`/folder/`	`/folder`	allow
http://example.com/page.htm	`/page`	`/*.htm`	undefined
http://example.com/	`/$`	`/`	allow
http://example.com/page.htm	`/$`	`/`	disallow

2. Robots meta tag，Robots tag标签

Robots meta tag是写在html文件的<head>与</head>之间的，类似于<meta name=”robots” content=”noindex” />的代码，其中name代表搜索引擎爬虫的名字，content代表对此爬虫进行的限制。content允许出现的内容有一下几种：

Directive	Meaning
`all`	没有限制。如果没有此标签的话，此值为默认值。
`noindex`	不允许搜索结果页中出现本页，以及不允许出现本页的cache
`nofollow`	nofollow本页的所有链接
`none`	相当于`noindex, nofollow`
`noarchive`	在搜索结果页中不允许出现cache
`nosnippet`	在搜索结果页中不出现本页摘要
`noodp`	不要使用本页在dmoz中的标题和描述作为搜索结果页中本页的标题和描述
`notranslate`	在搜索结果页中不提供本页的翻译
`noimageindex`	不允许搜索引擎索引本页的图片

对于一个页面中含有多个robots.txt文件的情况，如

<meta name="robots" content="nofollow">
<meta name="googlebot" content="noindex">

一般都比较简单，在此不再阐述。

3. X-Robots-Tag

X-Robots-Tag比较少见，它是在一个html文件的http头文件中出现的一段代码，可以起到和Robots meta tag一样的作用。如

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex
(…)

可以让爬虫不要索引本页。当然，它也可以指定爬虫，如“X-Robots-Tag: googlebot: nofollow”。使用X-Robots-Tag最大的好处在于可以通过修改apache服务器的配置来自动产生X-Robots-Tag标签。如果你想修改apache服务器下的所有网站的robots属性，或者希望修改一些无法加tag标签的文件，如pdf，gif，jpg等时，可以采用这种方法。如：将如下代码添加到.htaccess文件或者是httpd.conf文件，可以为apache下所有的pdf文件都加上noindex和noarchive属性。

<Files ~ "\.pdf$">
  Header set X-Robots-Tag "noindex, noarchive"
</Files>

最后附带一个常见的Google爬虫的列表。当然，Google必然还有其他的爬虫，如用户实时内容抓取的爬虫，用于防作弊的爬虫等等，而这些一般是不公布或经常变换的。

Crawler	User-agents (+ less-specific alternatives) followed in robots.txt, robots meta tags & X-Robots-Tag	User-agent in HTTP(S) requests	Comments + Documentation URL
Googlebot (web)	`Googlebot`	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Alternate (rarely used): Googlebot/2.1 (+http://www.google.com/bot.html)	Generic Googlebot crawler for web-search http://www.google.com/bot.html
Googlebot News	`Googlebot-News` (`Googlebot`)	Googlebot-News	For Google News
Googlebot Images	`Googlebot-Image` (`Googlebot`)	Googlebot-Image/1.0	For Image Search
Googlebot Video	`Googlebot-Video` (`Googlebot`)	Googlebot-Video/1.0	For Video Search
Googlebot Mobile	`Googlebot-Mobile` (`Googlebot`)	[various mobile device types] (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)	For Google Mobile web-search results http://www.google.com/support/mobile/bin/answer.py?answer=37425
Google Mobile AdSense	`Mediapartners-Google` `Mediapartners` (`Googlebot`)	[various mobile device types] (compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html)
Google AdSense	`Mediapartners-Google` `Mediapartners` (`Googlebot`)	Mediapartners-Google
Google AdsBot landing page quality check	`AdsBot-Google` (Given the special nature of this crawler, only directives for this user-agent are followed.)	AdsBot-Google (+http://www.google.com/adsbot.html)	Only visits landing pages used in AdWords campaigns. See http://www.google.com/adsbot.html

微信扫一扫或点击链接联系我

控制Google爬行和索引你的网站

1. Robots.txt文件：

2. Robots meta tag，Robots tag标签

3. X-Robots-Tag

《控制Google爬行和索引你的网站》有7条评论

发表评论