鱼C论坛

 找回密码
 立即注册
查看: 4762|回复: 1

请问怎么嵌入代理ip 目标是实现多个ip 爬取商品的多页评论 麻烦大神对代码优化 谢谢

[复制链接]
发表于 2018-1-2 18:30:17 | 显示全部楼层 |阅读模式
14鱼币
  1. #商品链接 http://detail.tmall.com/item.htm?id=41464129793
  2. #爬取商品多页评论:
  3. import requests
  4. import re
  5. import pandas as pd
  6. from pandas import DataFrame
  7. import time

  8. datapj = DataFrame(columns=[])
  9. for i in range(1,100):   
  10.    i = str(i)  
  11.    url='http://rate.tmall.com/list_detail_rate.htm?itemId=41464129793&sellerId=1652490016&currentPage=' + i
  12.    web = requests.get(url)
  13.    json = re.findall('rateList":(.*?),"searchinfo',web.text)[0]
  14.    table = pd.read_json(json)
  15.    datapj = pd.concat([datapj,table],axis=0,ignore_index=True)
  16.    time.sleep(10)
  17.            
  18. datapj.to_excel('datapj.xls')
复制代码



想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2018-1-5 17:55:09 | 显示全部楼层
  1. # 1. 爬取商品多页评论:
  2. #商品链接: https://detail.tmall.com/item.htm?id=555502261542
  3. #url: https://rate.tmall.com/list_detail_rate.htm?itemId=555502261542&sellerId=813836783&currentPage=1
  4. #user-agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36

  5. import requests
  6. import urllib.request
  7. import re
  8. import pandas as pd
  9. from pandas import DataFrame
  10. import time
  11. import random

  12. tlist = [6,8,11,15,16,18,22]
  13. #iplist = ['119.29.18.239:8888', '219.138.58.74:3128', '114.228.8.128:8118','60.177.225.111:808','180.156.95.80:8118']  #类型:https
  14. iplist = ['221.225.186.63:3128', '219.244.186.30:3128', '122.72.18.34:80','58.220.95.107:8080','116.31.75.100:3128']
  15. datapj = DataFrame(columns=[])
  16. for i in range(1,100):   
  17.    i = str(i)  
  18.    url='https://rate.tmall.com/list_detail_rate.htm?itemId=555502261542&sellerId=813836783&currentPage=' + i
  19.    proxy = urllib.request.ProxyHandler({'http':random.choice(iplist)})    #随机选择代理ip
  20.    opener = urllib.request.build_opener(proxy)      #创建opener模块
  21.    opener.addheaders = [('user-agent','Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36')]  #添加headers
  22.    urllib.request.install_opener(opener)     #安装opener
  23.    web = requests.get(url)
  24.    json = re.findall('rateList":(.*?),"searchinfo',web.text)[0]
  25.    table = pd.read_json(json)
  26.    datapj = pd.concat([datapj,table],axis=0,ignore_index=True)
  27.    time.sleep(random.choice(tlist))

  28. datapj.to_excel('datapj.xls')
复制代码


这是我对本贴问题的完善  我写的这个代码  只能爬取几十页就gameover   请问如何进一步优化修改?  我想爬取所有 99 页的评论  大神请赐教??? 所有鱼币奉上
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-4-24 05:12

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表