python爬取糗百内容（爬取糗事百科）

本文目录一览：

1、python爬虫项目实战：爬取用户的所有信息，如性别、年龄等
2、如何用Python爬取数据？
3、python3.4 + requests + re 仿写糗事百科爬虫，遇到一个疑惑，求助
4、python爬虫，怎么在爬段子的同时爬段子的附图？
5、python爬虫糗事百科入门求助

python爬虫项目实战：爬取用户的所有信息，如性别、年龄等

python爬虫项目实战：

爬取糗事百科用户的所有信息，包括用户名、性别、年龄、内容等等。

10个步骤实现项目功能，下面开始实例讲解：

1.导入模块

import re

import urllib.request

from bs4 import BeautifulSoup

2.添加头文件，防止爬取过程被拒绝链接

def qiuShi(url,page):

################### 模拟成高仿度浏览器的行为 ##############

heads ={

'Connection':'keep-alive',

'Accept-Language':'zh-CN,zh;q=0.9',

'Accept':'text/html,application/xhtml+xml,application/xml;

q=0.9,image/webp,image/apng, / ;q=0.8',

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36

(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',

}

headall = []

for key,value in heads.items():

items = (key,value)

headall.append(items)

opener = urllib.request.build_opener()

opener.addheaders = headall

urllib.request.install_opener(opener)

data = opener.open(url).read().decode()

################## end ########################################

3.创建soup解析器对象

soup = BeautifulSoup(data,'lxml')

x = 0

4.开始使用BeautifulSoup4解析器提取用户名信息

############### 获取用户名 ########################

name = []

unames = soup.find_all('h2')

for uname in unames:

name.append(uname.get_text())

#################end#############################

5.提取发表的内容信息

############## 发表的内容 #########################

cont = []

data4 = soup.find_all('div',class_='content')

data4 = str(data4)

soup3 = BeautifulSoup(data4,'lxml')

contents = soup3.find_all('span')

for content in contents:

cont.append(content.get_text())

##############end####################################

6.提取搞笑指数

#################搞笑指数##########################

happy = []

data2 = soup.find_all('span',class_="stats-vote")

data2 = str(data2) # 将列表转换成字符串形式才可以使用

soup1 = BeautifulSoup(data2,'lxml')

happynumbers = soup1.find_all('i',class_="number")

for happynumber in happynumbers:

happy.append(happynumber.get_text())

##################end#############################

7.提取评论数

############## 评论数 ############################

comm = []

data3 = soup.find_all('a',class_='qiushi_comments')

data3 = str(data3)

soup2 = BeautifulSoup(data3,'lxml')

comments = soup2.find_all('i',class_="number")

for comment in comments:

comm.append(comment.get_text())

############end#####################################

8.使用正则表达式提取性别和年龄

######## 获取性别和年龄 ##########################

pattern1 = 'div class="articleGender (w ?)Icon"(d ?)/div'

sexages = re.compile(pattern1).findall(data)

9.设置用户所有信息输出的格局设置

################## 批量输出用户的所以个人信息 #################

print()

for sexage in sexages:

sa = sexage

print(' ' 17, '= = 第', page, '页-第', str(x+1) + '个用户 = = ',' ' 17)

print('【用户名】：',name[x],end='')

print('【性别】：',sa[0],' 【年龄】：',sa[1])

print('【内容】：',cont[x])

print('【搞笑指数】：',happy[x],' 【评论数】：',comm[x])

print(' ' 25,' 三八分割线 ',' ' 25)

x += 1

###################end##########################

10.设置循环遍历爬取13页的用户信息

for i in range(1,14):

url = ' '+str(i)+'/'

qiuShi(url,i)

运行结果，部分截图：

python爬取糗百内容（爬取糗事百科）

如何用Python爬取数据？

方法/步骤

在做爬取数据之前，你需要下载安装两个东西，一个是urllib,另外一个是python-docx。

请点击输入图片描述

然后在python的编辑器中输入import选项，提供这两个库的服务

请点击输入图片描述

urllib主要负责抓取网页的数据，单纯的抓取网页数据其实很简单，输入如图所示的命令，后面带链接即可。

请点击输入图片描述

抓取下来了，还不算，必须要进行读取，否则无效。

请点击输入图片描述

接下来就是抓码了，不转码是完成不了保存的，将读取的函数read转码。再随便标记一个比如XA。

请点击输入图片描述

最后再输入三句，第一句的意思是新建一个空白的word文档。

第二句的意思是在文档中添加正文段落，将变量XA抓取下来的东西导进去。

第三句的意思是保存文档docx，名字在括号里面。

请点击输入图片描述

这个爬下来的是源代码，如果还需要筛选的话需要自己去添加各种正则表达式。

python3.4 + requests + re 仿写糗事百科爬虫，遇到一个疑惑，求助

物信息、统计、网页制作、计算等多个领域都体现出了强大的功能。python和其他脚本语言如java、R、Perl 一样，都可以直接在命令行里运行脚本程序。工具/原料

python；CMD命令行；windows操作系统

方法/步骤

1、首先下载安装python，建议安装2.7版本以上，3.0版本以下，由于3.0版本以上不向下兼容，体验较差。

2、打开文本编辑器，推荐editplus，notepad等，将文件保存成 .py格式，editplus和notepad支持识别python语法。

脚本第一行一定要写上 #!usr/bin/python

表示该脚本文件是可执行python脚本

如果python目录不在usr/bin目录下，则替换成当前python执行程序的目录。

3、编写完脚本之后注意调试、可以直接用editplus调试。调试方法可自行百度。脚本写完之后，打开CMD命令行，前提是python 已经被加入到环境变量中，如果没有加入到环境变量，请百度

4、在CMD命令行中，输入 “python” + “空格”，即 ”python “；将已经写好的脚本文件拖拽到当前光标位置，然后敲回车运行即可。

python爬虫，怎么在爬段子的同时爬段子的附图？

首先：取到相应图片的url地址

然后：下载

①使用urllib urllib.urlretrieve(url, path) 进行下载保存

②使用 open() 二进制形式读写文件

建议使用第一种方式下载

python爬虫糗事百科入门求助

你可以用一下BeautifulSoup这个，它是处理这种抓取下来的网页，直接利用标签和一些id、class来查找元素的，比较方便

python如何爬取js,python爬取百度贴吧

本文目录一览： 1、Python怎么获取网页中js生成的数据 2、如何用python爬虫直接获取被js修饰过的网页Elements？ 3、python中如何调用js文件中的方法呢 4、如何用pytho

2023-12-08

python爬虫之基础内容,python爬虫笔记

2022-11-21

python简单的爬取图片,python 爬图片

2022-11-21

python爬虫基础18,Python爬虫基础单词

2022-11-17

python爬数据用什么包（python用于爬虫的包）

2022-11-11

python网络爬虫7（python网络爬虫爬取图片）

2022-11-11

Windows 软件

Linux 软件

Mac 软件

安卓软件

各类文章

python爬取糗百内容（爬取糗事百科）

本文目录一览：

python爬虫项目实战：爬取用户的所有信息，如性别、年龄等

如何用Python爬取数据？

python3.4 + requests + re 仿写糗事百科爬虫，遇到一个疑惑，求助

python爬虫，怎么在爬段子的同时爬段子的附图？

python爬虫糗事百科入门求助

python爬取糗百内容（爬取糗事百科）

python爬取功能（python数据爬取）

python爬取漫画台（爬取漫画图片）

python爬取百度图库（python爬虫爬取百度图片）

Python爬虫快速入门

python爬取知乎话题图片（python爬取知乎回答）

python爬虫学习5,python爬虫笔记

python爬取b站排行榜（python爬虫b站）

智联招聘python抓包（python爬取智联招聘数据）

基于python爬取旅游攻略（python携程爬虫）

python爬虫干货总结,python爬虫详解

python爬虫25,python爬虫2层图片

Python爬取百度图片

python百度爬取图片,Python 爬图片

python如何爬取js,python爬取百度贴吧

python爬虫之基础内容,python爬虫笔记

python简单的爬取图片,python 爬图片

python爬虫基础18,Python爬虫基础单词

python爬数据用什么包（python用于爬虫的包）

python网络爬虫7（python网络爬虫爬取图片）

Windows 软件

Linux 软件

Mac 软件

安卓软件

各类文章

python爬取糗百内容（爬取糗事百科）

本文目录一览：

python爬虫项目实战：爬取用户的所有信息，如性别、年龄等

如何用Python爬取数据？

python3.4 + requests + re 仿写糗事百科爬虫，遇到一个疑惑，求助

python爬虫，怎么在爬段子的同时爬段子的附图？

python爬虫糗事百科入门求助

python爬取糗百内容（爬取糗事百科）

python爬取功能（python数据爬取）

python爬取漫画台（爬取漫画图片）

python爬取百度图库（python爬虫爬取百度图片）

Python爬虫快速入门

python爬取知乎话题图片（python爬取知乎回答）

python爬虫学习5,python爬虫笔记

python爬取b站排行榜（python爬虫b站）

智联招聘python抓包（python爬取智联招聘数据）

基于python爬取旅游攻略（python携程爬虫）

python爬虫干货总结,python爬虫详解

python爬虫25,python爬虫2层图片

Python爬取百度图片

python百度爬取图片,Python 爬图片

python如何爬取js,python爬取百度贴吧

python爬虫之基础内容,python爬虫笔记

python简单的爬取图片,python 爬图片

python爬虫基础18,Python爬虫基础单词

python爬数据用什么包（python用于爬虫的包）

python网络爬虫7（python网络爬虫爬取图片）

人机检测，请谅解