|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
我在IDLE 里边试了,一直不对,程序源码如下:
- import urllib.request
- import re
- import os
- def open_url(url):
- req = urllib.request.Request(url) # 利用request类构建一个完整的请求,可以增加header等信息
- req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \
- (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0')
- page = urllib.request.urlopen(req)
- html = page.read().decode("utf-8")
- return html
- def get_img(html):
- p = r'<img class="BDE_Image".*? src="[^"]*\.jpg".*?>'
- imglist = re.findall(p, html)
- try:
- os.mkdir("Newpictures")
- except FileExistsError:
- # 如果该文件夹已经存在则覆盖保存!
- pass
- os.chdir("Newpictures")
- for each in imglist:
- filename = each.split("/")[-1]
- urllib.request.urlretrieve(each, filename, None)
- if __name__ == "__main__":
- url = "http://tieba.baidu.com/p/3823765471"
- get_img(open_url(url))
复制代码
然后第一次修改是吧https -> http 还是有错。新建的文件夹已经出来了,但是里边没有下载下来图片。目前代码的错误如下:
- ===== RESTART: E:/SOFT-files/Python/FishC/Basic Example/14_5/14_5_10.py =====
- Traceback (most recent call last):
- File "E:/SOFT-files/Python/FishC/Basic Example/14_5/14_5_10.py", line 31, in <module>
- get_img(open_url(url))
- File "E:/SOFT-files/Python/FishC/Basic Example/14_5/14_5_10.py", line 26, in get_img
- urllib.request.urlretrieve(each, filename, None)
- File "C:\Program Files\Python37\lib\urllib\request.py", line 247, in urlretrieve
- with contextlib.closing(urlopen(url, data)) as fp:
- File "C:\Program Files\Python37\lib\urllib\request.py", line 222, in urlopen
- return opener.open(url, data, timeout)
- File "C:\Program Files\Python37\lib\urllib\request.py", line 525, in open
- response = self._open(req, data)
- File "C:\Program Files\Python37\lib\urllib\request.py", line 548, in _open
- 'unknown_open', req)
- File "C:\Program Files\Python37\lib\urllib\request.py", line 503, in _call_chain
- result = func(*args)
- File "C:\Program Files\Python37\lib\urllib\request.py", line 1387, in unknown_open
- raise URLError('unknown url type: %s' % type)
- urllib.error.URLError: <urlopen error unknown url type: img class="bde_image" src="https>
复制代码
然后再pycharm上试了之后,网址可以列出来,但是对应文件夹下还是没有新建的文件夹目录。代码改动的不是很多,如下:
- # -*- coding utf-8 -*-
- import urllib.request
- import re
- import os
- def open_url(url):
- req = urllib.request.Request(url) # 利用request类构建一个完整的请求,可以增加header等信息
- req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \
- (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0')
- page = urllib.request.urlopen(req)
- html = page.read().decode("utf-8")
- return html
- def get_img(html):
- p = r'<img class="BDE_Image".*? src="[^"]*\.jpg".*?>'
- imglist = re.findall(p, html)
- try:
- os.mkdir("Newpictures")
- except FileExistsError:
- # 如果该文件夹已经存在则覆盖保存!
- pass
- os.chdir("Newpictures")
- for each in imglist:
- filename = each.split("/")[-1]
- urllib.request.urlretrieve(each, filename, None)
- if __name__ == "__main__":
- url = "https://tieba.baidu.com/p/3823765471"
- get_img(open_url(url))
复制代码
十分感谢大家帮忙!谢谢!
我粘贴了你代码试运行,下载不到是因为大哥你没有给正则表达式分组啊,利用findall函数的话,方便在于,给src属性内部分组,那么这个findall函数返回的就是分组内部的内容,所以你要改成酱紫->p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>',就可以下载到了
|
-
可以看到右边编译内容没有问题
|