老师，你好，在运行老师视频的示例代码，报如下错： UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte 谷歌以后，网上说：“header是否存在 'Accept-Encoding':' gzip, deflate',这一句话，如果存在，删除即可解决。”，我删除了，运行正常，不太清楚原因是什么了，老师能否帮忙解答下呢？

作者回复: 遇到问题先查搜索引擎是个好习惯，如果其他人遇到了同意的问题往往已经有答案，如果别人从来没遇到过，大多数情况是你思考的方向不对。由于视频时间的关系，我没有在里面详细讲解http协议，首先来看Accept-Encoding的解释： HTTP Header中Accept-Encoding 是浏览器发给服务器,声明浏览器支持的编码类型（来源于百度百科）；说人话就是定义了客户端和浏览器传输传输”文本内容“时是否需要压缩，而gzip, deflate就是客户端和服务端通用的压缩算法。那么为什么会出现上面的UnicodeDecodeError的错误呢？是因为Python默认按照utf-8的编码读取网页文件时，发现是乱码（因为被压缩过了），所以报错了。就像是一个txt的文本文件如果被rar压缩软件压缩过，再用记事本打开是乱码是同样的道理。所以结论就是要根据服务端的网页编码确定是否需要进行 'Accept-Encoding':' gzip, deflate' 的解压缩操作。

2019-03-03



7

Metamorphosis

老师你这个课程对应的代码我没有找到，可不可以发给链接，谢谢。

作者回复: 链接：https://github.com/wilsonyin123/geekbangpython/tree/master/python_demo http头部信息是通过浏览器的F12调试信息抓取到的 from bs4 import BeautifulSoup import requests headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.8", "Connection": "close", "Cookie": "_gauges_unique_hour=1; _gauges_unique_day=1; _gauges_unique_month=1; _gauges_unique_year=1; _gauges_unique=1", "Referer": "http://www.infoq.com", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER" } url = 'http://www.infoq.com/cn/news' # 取得网页完整内容 def craw(url): response = requests.get(url, headers=headers) print(response.text) # craw(url) # 取得新闻标题 def craw2(url): response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'lxml') for title_href in soup.find_all('div', class_='news_type_block'): print([title.get('title') for title in title_href.find_all('a') if title.get('title')]) # craw2(url) # 翻页 for i in range(15, 46, 15): url = 'http://www.infoq.com/cn/news/' + str(i) # print(url) craw2(url)

2019-09-06



2

吴鹏飞

老师，请教一下，视频中你说要对headers进行utf8编码，但我看代码只是对dict字典进行了编码，headers没有吧？

作者回复: 你好，我会看了一下，应该是口误， headers并没有进行utf-8编码的设置，感谢指正

2021-04-11





o0oi1i

打卡65

2020-02-28





收起评论