中国基金会年报列表目爬虫

想要循环处理中国社会组织公共服务平台所有的年报，首先要获取所有年报的列表。我简单看了下官网的数据请求方式，主要是post每个年报对应的id来获取相关信息的，那么只要获取每一年报的id就行啦。

import requests,re,json,pymongo
from bs4 import BeautifulSoup
import multiprocessing as mp
def get_list(page):
    url  = 'http://www.chinanpo.gov.cn/bgsindex.html'
    headers = {
        'Host': 'www.chinanpo.gov.cn',
        'Origin': 'http://www.chinanpo.gov.cn',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.chinanpo.gov.cn/bgsindex.html',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
    }

    form_data = {
            'title': '',
            'websitId': '1000001',
            'page_flag': 'true',
            'goto_page': 'next',
            'current_page': page,
            'total_count': '2229'
        }
    r = requests.post(url, headers = headers, data = form_data)
    soup = BeautifulSoup(r.text, 'lxml')
    links = soup.find_all('td', align='left', valign="middle")
    for i in range(len(links)):
        piece = {
                'report_id': re.findall('[0-9]+', links[i].a.get('href'))[0],
                'diction_id': re.findall('[0-9]+', links[i].a.get('href'))[1],
                'fund_name': re.findall('(.*?)[0-9]', links[i].a.string)[0],
                'year': re.findall('[0-9]+', links[i].a.string)[0]
            }
        li.insert_one(piece)
    print('已完成{}页。'.format(page))

if __name__ == '__main__':
    myclient = pymongo.MongoClient('mongodb://localhost:27017')
    fund = myclient['fund']
    li = fund['li']
    pages = [i for i in range(0, 113)]
    p = mp.Pool()
    p.map_async(get_list, pages)
    p.close()
    p.join()

数据存储在fund数据库的li集合中，等我有空的时候拿出来继续做。

不过先得等维护结束哈。

中国基金会年报列表目爬虫

Comments | 1 条评论

博主傲娇的小基基

取消回复

Comments | 1 条评论

博主 傲娇的小基基

取消回复

博主傲娇的小基基