python爬取img（Python爬取网易云音乐）

本文目录一览：

1、如何用python实现爬取微博相册所有图片？
2、linux下python怎么写爬虫获取图片
3、使用python爬取网页，获取不到图片地址
4、python爬虫如何创建image文件夹
5、python爬取图片时忽略了一些图片
6、怎么使用python扒网上的照片

如何用python实现爬取微博相册所有图片？

三种方案:

1.直接用Python的requests库直接爬取，不过这个需要手动做的事情就比较多了，基本上就看你的Python功力了

2.使用scrapy爬虫框架，这个框架如果不熟悉的话只能自己先去了解下这个框架怎么用

3.使用自动测试框架selemium模拟登录操作，及图片爬取，这个对于大多数会点Python编码的人来说是最好的选择了，他比较直观的能看到怎么去获取数据

每种方案的前提都是你必须有一定基础的编码能力才行，不是随便一个人就能用的

linux下python怎么写爬虫获取图片

跟linux有什么关系，python是跨平台的，爬取图片的代码如下：

import urllib.requestimport osimport randomdef url_open(url):

req=urllib.request.Request(url) #为请求设置user-agent,使得程序看起来更像一个人类

req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0') #代理IP，使用户能以不同IP访问，从而防止被服务器发现

'''iplist=['1.193.162.123:8000','1.193.162.91:8000','1.193.163.32:8000']

proxy_support=urllib.request.ProxyHandler({'http':random.choice(iplist)})

opener=urllib.request.build_opener(proxy_support)

opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.154 Safari/537.36 LBBROWSER')]

urllib.request.install_opener(opener)'''

response=urllib.request.urlopen(req)

html=response.read() return htmldef get_page(url):

html=url_open(url).decode('utf-8')

a=html.find('current-comment-page')+23

b=html.find(']',a) #print(html[a:b])

return html[a:b]def find_imgs(url):

html=url_open(url).decode('utf-8')

img_addrs=[]

a=html.find('img src=') while a!=-1:

b=html.find('.jpg',a,a+140) if b!=-1: if html[a+9]!='h':

img_addrs.append('http:'+html[a+9:b+4]) else:

img_addrs.append(html[a+9:b+4]) else:

b=a+9

a=html.find('img src=',b) for each in img_addrs:

print(each+'我的打印') return img_addrsdef save_imgs(folder,img_addrs):

for each in img_addrs: #print('one was saved')

filename=each.split('/')[-1] with open(filename,'wb') as f:

img=url_open(each)

f.write(img)def download_mm(folder='ooxx',pages=10):

os.mkdir(folder)

os.chdir(folder)

url=""

page_num=int(get_page(url)) for i in range(pages):

page_num=page_num-1

page_url=url+'page-'+str(page_num)+'#comments'

img_addrs=find_imgs(page_url)

save_imgs(folder,img_addrs)if __name__=='__main__':

download_mm()1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374

完成

运行结果

python爬取img（Python爬取网易云音乐）

使用python爬取网页，获取不到图片地址

这个大图片是在点击之后用 JS 控制加载的。

你可以看看 js/js.js 这个文件，253 行：

function changeImg(){

jQuery("#bitImg").attr('src','p/p'+pictID+'/'+indexNum+'.'+jpgPng);

}

其实大图的规律很好找，下面缩略图列表的 src 可以用 #variContent li img 取到，可以在源码中的 107 行找到：

view-source:

缩略图列表地址长这样：

/p/p0997/tn/1.jpg

/p/p0997/tn/2.jpg

/p/p0997/tn/3.jpg

...

如果要获取大图，只要去掉“tn”这一段就可以：

/p/p0997/1.jpg

/p/p0997/2.jpg

/p/p0997/3.jpg

...

然后拼接域名在前面，GET 下来就是大图，比如第一个大图链接：

第一个大图地址

不过，你如果仅仅只是想要抓那个站的全部素材，穷举“p0997”这一段的序号（比如改成“p0098”，这个应该是图集的 ID），并且遍历最后一段的图片序号，扩展名可能是 jpg 也可能是 png，从 1 开始（“1.jpg”，“2.jpg”...）直到返回 404 停止。

思路大概是这么个思路，不过话说回来，你这么爬人家素材真的道德吗？

python爬虫如何创建image文件夹

有自动创建功能，也可以使用代码。1.观察网页，找到img标签；2.通过requests和BS库来提取网页中的img标签；3.抓取

_mg标签后，再把里面的src给提取出来，接下来就可以下载图片了；

?4.通过urllib的urllib.urlretrieve来下载图片并且放进文件夹里面（第一之前的准备工作就是获取当前路径然后新建一个文件夹）；

?5.如果有多张图片，不断的重复3-4。

python爬取图片时忽略了一些图片

真实图片地址是在客户端javascript代码中计算出来的.

你需要寻找

span class="img-hash"Ly93dzMuc2luYWltZy5jbi9tdzYwMC8wMDczdExQR2d5MWZ3Z3h6ajlrMGtqMzBpYjBramtnaS5qcGc=/span

这样的内容,取出

Ly93dzMuc2luYWltZy5jbi9tdzYwMC8wMDczdExQR2d5MWZ3Z3h6ajlrMGtqMzBpYjBramtnaS5qcGc=

这段内容，做base64解码即得图片地址。

相应的脚本在

//cdn.jandan.net/static/min/91798e4c623fa60181a31d543488217eB2GDr79r.03100001.js

这段内容你通过get_page()爬到地页面中有，同样，该页面中有这样的html(为便于阅读已重排格式)：

div class="text"

span class="righttext"

a href="//jandan.net/ooxx/page-34#comment-4001800"4001800/a

/span

img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /

span class="img-hash"Ly93dzMuc2luYWltZy5jbi9tdzYwMC8wMDczdExQR2d5MWZ3Z3h6ajlrMGtqMzBpYjBramtnaS5qcGc=/span

/div

这个img的onload调用的函数就在前面给出的那个js文件中：

function jandan_load_img(b){

var d=$(b);

var f=d.next("span.img-hash");

var e=f.text();

f.remove();

var c=jdDw3Ldvi4NcbKboi4X19hCAmdC3Q3aZvN(e,"DGmLfT4H73yJdXXpXs3pw7uAiICcflZS");

var a=$('a href="'+c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/,"$1large$3")+

'" target="_blank" class="view_img_link"[查看原图]/a');

d.before(a);

d.before("br");

d.removeAttr("onload");

d.attr("src",location.protocol+c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.gif)/,"$1thumb180$3"));

if(/\.gif$/.test(c)){

d.attr("org_src",location.protocol+c);

b.onload=function(){

add_img_loading_mask(this,load_sina_gif)

}

它调用了jdDw3Ldvi4NcbKboi4X19hCAmdC3Q3aZvN对img-hash的内容做解码,这个函数同样在这个js文件中：

var jdDw3Ldvi4NcbKboi4X19hCAmdC3Q3aZvN=function(o,y,g){

var d=o;var l="DECODE";

var y=y?y:"";

var g=g?g:0;

var h=4;

y=md5(y);

var x=md5(y.substr(0,16));

var v=md5(y.substr(16,16));

...中间部分略去...

if(l=="DECODE"){

m=base64_encode(m);

var c=new RegExp("=","g");

m=m.replace(c,"");

m=u+m;

m=base64_decode(d)

}

return m

};

你只需要在Python使用相应的库对抓取到的img-hash内容做解码即可得到图片地址。

你使用了str的find来从文本中定位位置,这样做太麻烦了，太多的代码细节，使用re模块做正则匹配就简单很多，更快的是直接使用现有的爬虫库.

使用re进行正则匹配，只需要使用正则式'span class="img-hash"(.+?)'即可提取出该页面中所有加密的图片地址。

import re

import base64

pat = re.compile('span class="img-hash"(.+?)')

...

def get_imgurls(url):

urls = []

for imgurl in pat.findall(url_open(url).decode('utf-8')):

.append(str(base64.b64decode(imgurl), 'utf-8'))

return urls

然后就可以对get_imgurls返回的列表遍历，逐个交给save_img处理了。

使用爬取库也只需要寻找span,从中找出class='img-hash'即可读取text。

怎么使用python扒网上的照片

# coding=utf-8

# 声明编码方式默认编码方式ASCII

import urllib

import time

import re

import os

'''''

Python下载游迅网图片 BY:Eastmount

'''

'''''

**************************************************

#第一步遍历获取每页对应主题的URL

**************************************************

'''

fileurl=open('yxdown_url.txt','w')

fileurl.write('****************获取游讯网图片URL*************\n\n')

#建议num=3 while num=3一次遍历一个页面所有主题,下次换成num=4 while num=4而不是1-75

num=3

while num=3:

temp = ''+str(num)+'.html'

content = urllib.urlopen(temp).read()

open('yxdown_'+str(num)+'.html','w+').write(content)

print temp

fileurl.write('****************第'+str(num)+'页*************\n\n')

#爬取对应主题的URL

#div class="cbmiddle"/div中a target="_blank" href="/html/5533.html"

count=1 #计算每页1-75中具体网页个数

res_div = r'div class="cbmiddle"(.*?)/div'

m_div = re.findall(res_div,content,re.S|re.M)

for line in m_div:

#fileurl.write(line+'\n')

#获取每页所有主题对应的URL并输出

if "_blank" in line: #防止获取列表list/1_0_1.html list/2_0_1.html

#获取主题

fileurl.write('\n\n********************************************\n')

title_pat = r'b class="imgname"(.*?)/b'

title_ex = re.compile(title_pat,re.M|re.S)

title_obj = re.search(title_ex, line)

title = title_obj.group()

print unicode(title,'utf-8')

fileurl.write(title+'\n')

#获取URL

res_href = r'a target="_blank" href="(.*?)"'

m_linklist = re.findall(res_href,line)

#print unicode(str(m_linklist),'utf-8')

for link in m_linklist:

fileurl.write(str(link)+'\n') #形如"/html/5533.html"

'''''

**************************************************

#第二步去到具体图像页面下载HTML页面

#注意先本地创建yxdown 否则报错No such file or directory

**************************************************

'''

#下载HTML网页无原图故加'#p=1'错误

#HTTP Error 400. The request URL is invalid.

html_url = ''+str(link)

print html_url

html_content = urllib.urlopen(html_url).read() #具体网站内容

#可注释它暂不下载静态HTML

open('yxdown/yxdown_html'+str(count)+'.html','w+').write(html_content)

'''''

#第三步去到图片界面下载图片

#点击"查看原图"HTML代码如下

#a href="javascript:;" style=""onclick="return false;"查看原图/a

#通过JavaScript实现而且该界面存储所有图片链接script/script之间

'''

html_script = r'script(.*?)/script'

m_script = re.findall(html_script,html_content,re.S|re.M)

for script in m_script:

res_original = r'"original":"(.*?)"' #原图

m_original = re.findall(res_original,script)

for pic_url in m_original:

print pic_url

fileurl.write(str(pic_url)+'\n')

'''''

#第四步下载图片

#如果浏览器存在验证信息如维基百科需添加如下代码

class AppURLopener(urllib.FancyURLopener):

version = "Mozilla/5.0"

urllib._urlopener = AppURLopener()

'''

filename = os.path.basename(pic_url) #去掉目录路径,返回文件名

#No such file or directory 需要先创建文件Picture3

urllib.urlretrieve(pic_url, 'E:\\Picture3\\'+filename)

#IOError: [Errno socket error] [Errno 10060]

#只输出一个URL 否则输出两个相同的URL

break

#当前页具体内容个数加1

count=count+1

time.sleep(0.1)

else:

print 'no url about content'

time.sleep(1)

num=num+1

else:

print 'Download Over!!!'

Windows 软件

Linux 软件

Mac 软件

安卓软件

各类文章

python爬取img（Python爬取网易云音乐）

本文目录一览：

如何用python实现爬取微博相册所有图片？

linux下python怎么写爬虫获取图片

使用python爬取网页，获取不到图片地址

python爬虫如何创建image文件夹

python爬取图片时忽略了一些图片

怎么使用python扒网上的照片

python爬取img（Python爬取网易云音乐）

python爬音乐数据（python音乐的数据抓取与分析）

Python爬取QQ音乐：从入门到精通

python爬取图片的步骤,python简单爬取图片

python爬虫爬取网上的照片（python爬取图片代码）

python爬取百度图库（python爬虫爬取百度图片）

python爬虫笔记安装篇（python爬虫模块安装）

python爬取接口的图片（python爬虫怎么爬取图片）

python爬取美空网女神图片（python爬取美女图片程序

python爬取天堂网图片,python爬取电影天堂

python网络爬虫7（python网络爬虫爬取图片）

python爬取学习通题库（爬虫爬取题库）

python爬取漫画台（爬取漫画图片）

python爬取图片脚本,Python爬虫爬取图片

Python 爬虫实战：抓取网站数据

python百度爬取图片,Python 爬图片

Python爬虫代码分享

python爬抖音数据（爬虫抖音数据）

python爬取网站数据步骤,Python爬取网站

python简单的爬取图片,python 爬图片

Windows 软件

Linux 软件

Mac 软件

安卓软件

各类文章

python爬取img（Python爬取网易云音乐）

本文目录一览：

如何用python实现爬取微博相册所有图片？

linux下python怎么写爬虫获取图片

使用python爬取网页，获取不到图片地址

python爬虫如何创建image文件夹

python爬取图片时忽略了一些图片

怎么使用python扒网上的照片

python爬取img（Python爬取网易云音乐）

python爬音乐数据（python音乐的数据抓取与分析）

Python爬取QQ音乐：从入门到精通

python爬取图片的步骤,python简单爬取图片

python爬虫爬取网上的照片（python爬取图片代码）

python爬取百度图库（python爬虫爬取百度图片）

python爬虫笔记安装篇（python爬虫模块安装）

python爬取接口的图片（python爬虫怎么爬取图片）

python爬取美空网女神图片（python爬取美女图片程序

python爬取天堂网图片,python爬取电影天堂

python网络爬虫7（python网络爬虫爬取图片）

python爬取学习通题库（爬虫爬取题库）

python爬取漫画台（爬取漫画图片）

python爬取图片脚本,Python爬虫爬取图片

Python 爬虫实战：抓取网站数据

python百度爬取图片,Python 爬图片

Python爬虫代码分享

python爬抖音数据（爬虫抖音数据）

python爬取网站数据步骤,Python爬取网站

python简单的爬取图片,python 爬图片

人机检测，请谅解