极客时间-轻松学习，高效学习-极客邦

鬼金阳

2019-02-02

最近想爬一个aspx网站，发现aspx网站爬虫方法挺复杂的，网上介绍都挺笼统，请问老师有没有这方面比较详细的教程资料？

作者回复: 如果是想系统的爬取一个网站，建议使用框架来实现，视频介绍的是爬虫的原理和自己编写爬虫，建议你参考一下scrapy框架，提供一个中文文档地址：
https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html



 1
旭茂

2020-01-11

网页改了，div用items__content，两个双下划线的：
def craw2(url):
    response = requests.get(url, headers=headers)
    # print(response.text)

    soup = BeautifulSoup(response.text, 'lxml')

    for title_href in soup.find_all('div', class_='items__content'):
        for title in title_href.find_all('a'):
            if title.get('title'):
                print(title.get('title'))

def my_thread(current_url, page_number):
    craw2(current_url)
    print('上面为第%s页的数据\n\n\n\n' % (page_number))

for i in range(0, 46, 15):
    if i == 0:
        url = 'http://www.infoq.com/news'
    else:
        url = 'http://www.infoq.com/news' + str(i)
    t1 = threading.Thread(target=my_thread, args=(url, i))
    t1.start()

展开

作者回复: 可以用英文站来练习，改版不大，参考
https://github.com/wilsonyin123/geekbangpython/tree/master/python_demo




程序员人生

2019-08-02

这个网页已经爬不了啦

作者回复: 可以用其他静态页面试一下，网站改版了




Lemon

2019-07-29

for in那个语法用了省略的方式，可以再解释的详细一点吗？之前的课没有讲到过

 1


🌟双子嘟🌟🙄�...

2019-07-02

老师，网站的数据如果是页面打开后，通过JS调用接口去生成的，是不是不能使用这种方式

作者回复: 动态网页用selenium+chrome（phantonJS）




不麻烦

2019-04-23

现在运行没有打印数据，是不是别人做了反爬虫？萌新求解

作者回复: 可以将抓取网页部分的代码单独拆分出来，单独运行，看看是否有输出




硕杨Sxuya

2019-03-28

下面是 infoq 改版后网页的获取内容的代码，但是得到的内容并不是浏览器查看html的内容，而是十分少的内容，还有乱麻。请问怎么回事呢？
---
from bs4 import BeautifulSoup
import requests

header_i = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Cache-Control": "no-cache",
    "Connection": "keep-alive",
    "Cookie": "_ga=GA1.2.236308595.1542204557; _itt=1; GCID=7d9c08f-e052716-92486ca-1ef06ad-cd; GCESS=BAQEAC8NAAMEbUWUXAIEbUWUXAEEmV4PAAoEAAAAAAYE1hDl2gcEjYGeiwkBAQgBAwsCBAAMAQEFBAAAAAA-; Hm_lvt_094d2af1d9a57fd9249b3fa259428445=1553224053; Hm_lpvt_094d2af1d9a57fd9249b3fa259428445=1553227368; SERVERID=1fa1f330efedec1559b3abbcb6e30f50|1553227540|1553224054",
    "DNT": "1",
    "Host": "www.infoq.cn",
    "Pragma": "no-cache",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}

url = 'https://www.infoq.cn'

response = requests.get(url, headers=header_i)

soup = BeautifulSoup(response.text, 'lxml')

print(soup.prettify())

展开

作者回复: 您好，因为infoq在视频录制之后网页更新过，因此需要根据具体的报错来调整爬虫的代码。
乱码一般是因为 http头部“ "Accept-Encoding": "gzip, deflate, br",” 信息传递的问题，考虑去掉gzip 再试一下




Nick

2019-01-01

for title_href in soup.find_all('div', class_='news_type_block'):
print([title.get('title')
for title in title_href.find_all('a') if title.get('title')])
最下面两行是啥语法？

作者回复: for in是python的遍历某个对象的语法噢




不想当小白

2018-10-24

请问老师，那个headers={}中的内容怎么获取呀？

作者回复: headers内容来源于标准的http协议的定义，一般我会先使用浏览器访问目标网站，发起第一次请求前，按F12出现浏览器的调试界面，在请求时就可以抓到对应的headers 。当然还能抓到很多其他有用的信息噢



