Nginx服务器屏蔽与禁止屏蔽网络爬虫的方法

唐伯虎 发表于 2021-8-11 13:51:33

每个网站通常都会遇到很多非搜索引擎的爬虫，这些爬虫大部分都是用于内容采集或是初学者所写，它们和搜索引擎的爬虫不一样，没有频率控制，往往会消耗大量服务器资源，导致带宽白白浪费了。
其实Nginx可以非常容易地根据User-Agent过滤请求，我们只需要在需要URL入口位置通过一个简单的正则表达式就可以过滤不符合要求的爬虫请求：

location / {
if ($http_user_agent ~* "python|curl|java|wget|httpclient|okhttp") {
return 503;
}
# 其它正常配置
...
}
注意：变量$http_user_agent是一个可以直接在location中引用的Nginx变量。~*表示不区分大小写的正则匹配，通过python就可以过滤掉80%的Python爬虫。
Nginx中禁止屏蔽网络爬虫

server {
listen 80;
server_name www.xxx.com;
#charset koi8-r;
#access_log logs/host.access.log main;
#location / {
#roothtml;
#index index.html index.htm;
#}
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") {
   return 403;
}
location ~ ^/(.*)$ {
   proxy_pass http://localhost:8080;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For$proxy_add_x_forwarded_for;
client_max_body_size10m;
client_body_buffer_size 128k;
proxy_connect_timeout90;
proxy_send_timeout 90;
proxy_read_timeout 90;
proxy_buffer_size 4k;
proxy_buffers    4 32k;
proxy_busy_buffers_size 64k;
proxy_temp_file_write_size 64k;
}
#error_page 404    /404.html;
# redirect server error pages to the static page /50x.html
#
error_page500 502 503 504 /50x.html;
location = /50x.html {
   roothtml;
}
# proxy the PHP scripts to Apache listening on 127.0.0.1:80
#
#location ~ \.php$ {
#proxy_passhttp://127.0.0.1;
#}
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
#
#location ~ \.php$ {
#root    html;
#fastcgi_pass127.0.0.1:9000;
#fastcgi_index index.php;
#fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name;
#include fastcgi_params;
#}
# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
#
#location ~ /\.ht {
#deny all;
#}
}
可以用 curl 测试一下

curl -I -A "qihoobot" www.xxx.com
总结
以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家对服务器之家的支持。如果你想了解更多相关内容请查看下面相关链接
原文链接：http://www.codetc.com/article-353-1.html

文档来源：服务器之家http://www.zzvips.com/article/45597.html

页: [1]

CodeAE代码之家-专为程序员打造的技术家园！-网站地图

Nginx服务器屏蔽与禁止屏蔽网络爬虫的方法