您的位置:

微信公众号文章爬取技巧:抓取有用内容的秘诀

微信公众号是现在信息传播的重要平台之一,每天都有大量的文章在上面发布。但是我们并不是每个人都有时间去逐一阅读这些文章,因此,本文将介绍如何通过编程的方式自动抓取微信公众号文章,提取我们所需的内容。

一、抓取文章URL

要想抓取微信公众号文章,首先需要获取文章的URL。我们可以通过微信公众号后台管理的接口来获取文章的URL。 ``` python import requests # 获取文章URL的接口 article_url_api = 'https://mp.weixin.qq.com/cgi-bin/appmsg' # 输入cookie信息和公众号偏移量,可以获取一定量的文章URL信息 def get_article_url_cookie(offset): # 替换为自己的cookies信息 cookie = 'xxx' headers = { 'Cookie': cookie, } params = ( ('action', 'list_ex'), ('begin', '0'), ('count', '5'), ('fakeid', '123456'), ('type', '9'), ('query', ''), ('token', '123456789'), ('lang', 'zh_CN'), ('f', 'json'), ('ajax', '1'), ('random', '0.12345678901234567'), ('random', '0.12345678901234567'), ('lang', 'zh_CN'), ('created', '7'), ('scene', '124'), ('devicetype', 'Windows 10'), ('appmsg_token', '123456789'), ('offset', str(offset)) ) response = requests.get(article_url_api, headers=headers, params=params) results = response.json().get('app_msg_list') urls = [] for res in results: urls.append(res.get('link')) return urls # 通过公众号偏移量,可以获取一定量的文章URL信息 def get_article_url_by_offset(): for i in range(0, 50, 5): urls = get_article_url_cookie(i) print(urls) ``` 上面的代码使用了Python的requests模块,通过发送HTTP请求并解析相应的JSON格式数据获取URL信息。这里我们需要替换为自己的cookies信息和公众号偏移量。

二、抓取文章内容

获取文章URL之后,我们需要再次发送HTTP请求获取文章的具体内容。这里我们使用了Python的beautifulsoup库,它可以方便我们解析HTML格式的网页内容,并提取我们需要的信息。具体实现代码如下所示: ``` python import requests from bs4 import BeautifulSoup cookies = {'key': 'value'} # 替换成自己的cookies信息 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.2.3964.2 Safari/537.36' } def get_article_content(url): response = requests.get(url, headers=headers, cookies=cookies) response.encoding = 'utf-8' soup = BeautifulSoup(response.text, 'html.parser') title = soup.select('#activity-name')[0].get_text(strip=True) author = soup.select('#meta_content > span.rich_media_meta.rich_media_meta_text.rich_media_meta_nickname')[0].get_text(strip=True) date = soup.select('#meta_content > span.rich_media_meta.rich_media_meta_text')[1].get_text(strip=True) content = str(soup.select('#js_content')[0]) return title, author, date, content url = 'https://mp.weixin.qq.com/s/xxxxxx' title, author, date, content = get_article_content(url) print("title:", title) print("author:", author) print("date:", date) print("content:", content) ``` 上面的代码中使用了#activity-name、#meta_content、#js_content等CSS选择器获取文章的标题、作者、发布日期、正文内容。

三、保存文章内容

抓取到文章内容之后,我们可以将文本内容保存为本地文件,以便后续的处理。这里我们使用Python自带的open函数,将文字内容写入本地文件中。 ``` python import requests from bs4 import BeautifulSoup cookies = {'key': 'value'} # 替换成自己的cookies信息 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.2.3964.2 Safari/537.36' } def get_article_content(url): response = requests.get(url, headers=headers, cookies=cookies) response.encoding = 'utf-8' soup = BeautifulSoup(response.text, 'html.parser') title = soup.select('#activity-name')[0].get_text(strip=True) author = soup.select('#meta_content > span.rich_media_meta.rich_media_meta_text.rich_media_meta_nickname')[0].get_text(strip=True) date = soup.select('#meta_content > span.rich_media_meta.rich_media_meta_text')[1].get_text(strip=True) content = str(soup.select('#js_content')[0]) return title, author, date, content # 将文章转成html格式,写入文件 def save_article_html(title, author, date, content, filename): with open(filename, mode='w', encoding='utf-8') as f: f.write(f' {title}

title: {title}

author: {author}

date: {date}

{content}