• Greenery
    2023-07-25 来自新加坡
    应该不会快吧,同一张图片的拉取和保存只能是串行的,除非是流水线

    作者回复: 多线程并不总是能提高性能,使用异步编程模型(例如使用 asyncio)可能是一个更好的选择。我简单列举几种多线程比单线程慢的原因给你做参考: GIL(全局解释器锁):CPython(Python的标准实现)有一个称为全局解释器锁(GIL)的机制,限制了多线程在CPU密集型任务中的并发执行。虽然I/O密集型任务(如文件下载)通常不受GIL的影响,但在某些情况下,GIL可能仍会产生影响。 线程管理开销:创建和管理多个线程会带来一定的开销。如果线程数量太多,这些开销可能会超过并发执行带来的性能提升。 网络带宽限制:如果你的网络带宽已经被单线程下载充分利用,那么增加更多的线程可能不会带来任何提速效果。反而,多线程之间的竞争可能会导致整体速度下降。 服务器限制:有些服务器可能对来自同一客户端的并发连接数量有限制。过多的线程可能会触发这些限制,导致连接被拒绝或速度下降。 磁盘I/O限制:如果下载的文件直接写入磁盘,磁盘的I/O速度可能会成为瓶颈。多线程同时写入可能会增加I/O等待时间。 错误的线程数量选择:选择合适的线程数量是一个复杂的问题,太少或太多的线程都可能不是最优选择。一个通常的做法是基于系统的核心数量和任务类型来选择线程数量。 代码实现问题:如果多线程的实现有同步问题、死锁、资源竞争等,这也可能导致性能下降。

    
    
  • Cy23
    2023-01-31 来自辽宁
    通过我编写的程序测试,单进程池比双进程池运行时间快,但是不排除我程序写的有问题, 1.单进程池,一个图片下载完后,通过result获取返回后才能执行保存, 2.双进程池,需要下载进程池所有图片都下载返回后,再执行保存, 3.在我看来双进程池如果下载量过大,会不会造成内存吃紧的情况呢?单进程池就不需要等待所有。 4.但是我很迷惑,就算是单进程池还是双进程池,都需要等待返回下载后的数据才能继续进行,下一次下载是需要等待前一次下载完成后执行,也就是不管设置多少个工作进程,也就一个能起作用

    作者回复: 单进程池比双进程池运行时间快,这种情况是有可能出现的,首先进程并不像看上去那样,从进程开始到执行结束,一直占用CPU,一旦出现资源不足,操作系统会自动将当前进程的“作案现场”保留下来,切换到其他进程,而当前进程资源满足以后,又要切换回来,因此会出现多进程比单进程慢的情况,甚至有可能出现进程开启的数量越多越慢的场景。我只是列举了一个比较容易出现的情况,这也是程序员比较考验对系统的理解和编程能力的地方

    
    
  • Greenery
    2023-07-25 来自新加坡
    我尝试了双线程池是要比单线程池快的: # %% 单线程池 导入&定义 def fetch_save(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", "Referer": "https://time.geekbang.org"} r = requests.get(url, headers=headers) with open(f'img/s_{url.split("/")[-1].split(".")[0]}.jpg', 'wb') as f: f.write(r.content) # %% 单线程池 执行 start = perf_counter() with ThreadPoolExecutor(max_workers=4) as e: for i in images: e.submit(fetch_save, i) print(f'time latency: {perf_counter() - start}') # time latency: 0.36856459999398794 # %% 多线程池 导入&定义 url2r = {} def fetch(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", "Referer": "https://time.geekbang.org"} url2r[url] = requests.get(url, headers=headers) def save(url): r = url2r[url] with open(f'img/d_{url.split("/")[-1].split(".")[0]}.jpg', 'wb') as f: f.write(r.content) # %% 多线程池 执行 start = perf_counter() with ThreadPoolExecutor(max_workers=4) as e: for i in images: e.submit(fetch, i) with ThreadPoolExecutor(max_workers=4) as e: for i in images: e.submit(save, i) print(f'time latency: {perf_counter() - start}') # time latency: 0.29097709999768995
    展开
    
    
  • Greenery
    2023-07-25 来自新加坡
    我尝试了双线程池是要比单线程池快的: # %% 单线程池 导入&定义 def fetch_save(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", "Referer": "https://time.geekbang.org"} r = requests.get(url, headers=headers) with open(f'img/s_{url.split("/")[-1].split(".")[0]}.jpg', 'wb') as f: f.write(r.content) # %% 单线程池 执行 start = perf_counter() with ThreadPoolExecutor(max_workers=4) as e: for i in images: e.submit(fetch_save, i) print(f'time latency: {perf_counter() - start}') # time latency: 0.36856459999398794 # %% 多线程池 导入&定义 url2r = {} def fetch(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", "Referer": "https://time.geekbang.org"} url2r[url] = requests.get(url, headers=headers) def save(url): r = url2r[url] with open(f'img/d_{url.split("/")[-1].split(".")[0]}.jpg', 'wb') as f: f.write(r.content) # %% 多线程池 执行 start = perf_counter() with ThreadPoolExecutor(max_workers=4) as e: for i in images: e.submit(fetch, i) with ThreadPoolExecutor(max_workers=4) as e: for i in images: e.submit(save, i) print(f'time latency: {perf_counter() - start}') # time latency: 0.29097709999768995
    展开
    
    
  • Francis
    2023-02-06 来自上海
    from concurrent.futures import ProcessPoolExecutor import requests import time def download(url): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Referer": "https://time.geekbang.org"} response = requests.get(url, headers=headers) return response def save(url, response): with open(f'./imgs/{url.split("/")[-1]}', 'wb') as f: f.write(response.content) if __name__ == "__main__": headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Referer": "https://time.geekbang.org"} data = {"ids":[100085301,100063601,100023001,100026901,100008801,100002201,100061901,100053201],"with_first_articles":False} r = requests.post('https://time.geekbang.org/serv/v3/product/infos', headers=headers, json=data) datas = r.json() images = [] for d in datas["data"]["infos"]: print(d["author"]["avatar"]) images.append(d["author"]["avatar"]) # 单进程池 start = time.perf_counter() with ProcessPoolExecutor(max_workers=4) as pp: for url in images: image_download = pp.submit(download, url) res = image_download.result image_save = pp.submit(save, url, res) end = time.perf_counter() print(f'单进程池程序运行时间为: {end-start} Seconds') # 双进程池 start = time.perf_counter() response_dict = {} with ProcessPoolExecutor(max_workers=4) as pp_double_d: for url in images: image_download = pp_double_d.submit(download, url) response_dict[url] = image_download.result with ProcessPoolExecutor(max_workers=4) as pp_double_s: for url,res in response_dict.items(): image_save = pp_double_s.submit(save, url, res) end = time.perf_counter() print(f'双进程池程序运行时间为: {end-start} Seconds')
    展开
    
    
  • Cy23
    2023-01-31 来自辽宁
    import requests from concurrent.futures import ProcessPoolExecutor import time def download(url): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Referer": "https://time.geekbang.org"} response = requests.get(url, headers=headers) return response def save(url, response): with open(f'imgs/{url.split("/")[-1]}','wb') as f: f.write(response.content) if __name__ == '__main__': headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Referer": "https://time.geekbang.org"} data = {"ids":[100085301,100063601,100023001,100026901,100008801,100002201,100061901,100053201],"with_first_articles":False} r = requests.post('https://time.geekbang.org/serv/v3/product/infos',headers=headers, json=data) datas = r.json() images = [] # 取得头像 for d in datas["data"]["infos"]: images.append(d["author"]["avatar"]) # 单进程池 start = time.perf_counter() with ProcessPoolExecutor(max_workers=4) as processpool: for i in images: future_download = processpool.submit(download, i) res = future_download.result() future_save = processpool.submit(save, i, res) end = time.perf_counter() print('单进程池程序运行时间为: %s Seconds'%(end-start)) # 双进程池 start = time.perf_counter() response_dict = {} with ProcessPoolExecutor(max_workers=4) as processpool_download: for i in images: future_download = processpool_download.submit(download, i) response_dict[i] = future_download.result() with ProcessPoolExecutor(max_workers=4) as processpool_save: for i, j in response_dict.items(): future_save = processpool_save.submit(save, i, j) end = time.perf_counter() print('双进程池程序运行时间为: %s Seconds'%(end-start))
    展开
    
    
  • Matthew
    2023-01-20 来自江苏
    如果将下载和文件保存两个函数分别放在两个进程池中,并不能提高效率。
    
    
  • Matthew
    2023-01-20 来自江苏
    import requests # import json from concurrent.futures import ThreadPoolExecutor # 下载图片 def download(url): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Referer": "https://time.geekbang.org"} response = requests.get(url, headers=headers) return response # 保存图片 def save(url, response): with open(f'imgs/{url.split("/")[-1]}.jpg','wb') as f: f.write(response.content) ## 测试 if __name__ == '__main__': headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Referer": "https://time.geekbang.org"} data = {"ids":[100085301,100063601,100023001,100026901,100008801,100002201,100061901,100053201],"with_first_articles":False} r = requests.post('https://time.geekbang.org/serv/v3/product/infos',headers=headers, json=data) datas = r.json() # count = datas["data"]["infos"][0]["extra"]["sub"]["count"] # print(f"订阅数 {count}") images = [] # 取得头像 for d in datas["data"]["infos"]: images.append(d["author"]["avatar"]) print(d["author"]["avatar"]) # 创建线程池,下载图片 threadPool_download = ThreadPoolExecutor(max_workers=4, thread_name_prefix="down_") response_dict = {} for i in images: future_download = threadPool_download.submit(download, i) response_dict[i] = future_download.result() # print(future_download.result().content) threadPool_download.shutdown(wait=True) # 创建线程池,保存图片 threadPool_save = ThreadPoolExecutor(max_workers=4, thread_name_prefix="save_") for i, j in response_dict.items(): future_save = threadPool_save.submit(save, i, j) threadPool_save.shutdown(wait=True)
    展开
    
    