下面是 infoq 改版后网页的获取内容的代码,但是得到的内容并不是浏览器查看html的内容,而是十分少的内容,还有乱麻。请问怎么回事呢?
---
from bs4 import BeautifulSoup
import requests
header_i = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Cookie": "_ga=GA1.2.236308595.1542204557; _itt=1; GCID=7d9c08f-e052716-92486ca-1ef06ad-cd; GCESS=BAQEAC8NAAMEbUWUXAIEbUWUXAEEmV4PAAoEAAAAAAYE1hDl2gcEjYGeiwkBAQgBAwsCBAAMAQEFBAAAAAA-; Hm_lvt_094d2af1d9a57fd9249b3fa259428445=1553224053; Hm_lpvt_094d2af1d9a57fd9249b3fa259428445=1553227368; SERVERID=1fa1f330efedec1559b3abbcb6e30f50|1553227540|1553224054",
"DNT": "1",
"Host": "www.infoq.cn",
"Pragma": "no-cache",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
url = 'https://www.infoq.cn'
response = requests.get(url, headers=header_i)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())
展开
作者回复: 您好,因为infoq在视频录制之后网页更新过,因此需要根据具体的报错来调整爬虫的代码。
乱码一般是因为 http头部“ "Accept-Encoding": "gzip, deflate, br",” 信息传递的问题, 考虑去掉gzip 再试一下