为了不让自己完全忘掉爬虫知识,所以写个小程序练练手。其实这个程序比我写过的所有scrapy框架的程序都要简单许多,唯一的进步大概是用上了前几天学的Mysql数据库知识吧。
import requests, re
from lxml import etree
import pymysql
headers = {
"user-agent": "你的User-Agent"
}
connection = pymysql.connect(
host='localhost',
user='root',
password='你的密码',
db='animes',
charset='utf8mb4',
)
cursor = connection.cursor()
for year in range(2010, 2020):
for season in ['winter', 'fall', 'summer', 'spring']:
url = 'https://myanimelist.net/anime/season/{}/{}'.format(year, season)
r = requests.get(url, headers=headers)
response = etree.HTML(r.content)
seasonal_anime_list = response.xpath('//div[@class="seasonal-anime js-seasonal-anime"]')
for anime in seasonal_anime_list:
title = anime.xpath('.//a[@class="link-title"]/text()')[0].replace('\'', '\'\'')
producer = anime.xpath('.//span[@class="producer"]//text()')[0].replace('\'', '\'\'')
try:
eps = re.findall('\d+', ''.join(anime.xpath('.//div[@class="eps"]//text()')))[0]
eps = int(eps)
except:
eps = 0
try:
score = re.sub('\s', '', anime.xpath('.//span[@title="Score"]//text()')[0])
score = float(score)
except:
score = 0
try:
members = re.sub('\s|,', '', anime.xpath('.//span[@title="Members"]//text()')[0])
members = int(members)
except:
members = 0
source = anime.xpath('.//span[@class="source"]//text()')[0]
genre = ', '.join(re.findall('\S+', ''.join(anime.xpath('.//span[@class="genre"]//text()')))).replace('\'', '\'\'')
anime_json = {
'title': title,
'producer': producer,
'eps': eps,
'score': score,
'members': members,
'source': source,
'genre': genre,
'season': season,
'year': year
}
query_sql = "SELECT count(*) FROM `MyAnimeList` WHERE `title`='{}';".format(title)
cursor.execute(query_sql)
if_exist = cursor.fetchone()[0]
if if_exist == 0:
insert_sql = 'INSERT INTO `MyAnimeList`({}) VALUES({})'.format(', '.join(['`{}`'.format(i) for i in anime_json.keys()]), ', '.join(["'{}'".format(i) for i in anime_json.values()]))
cursor.execute(insert_sql)
connection.commit()
connection.close()
在运行这个程序之前,你得先在Mysql里创建库和表:
DROP DATABASE IF EXISTS animes; CREATE DATABASE animes; USE animes; DROP TABLE IF EXISTS MyAnimeList; CREATE TABLE MyAnimeList( `id` INT AUTO_INCREMENT, `title` TEXT, `producer` TEXT, `eps` INT, `score` FLOAT, `members` INT, `source` TEXT, `genre` TEXT, `season` VARCHAR(6), `year` VARCHAR(4), PRIMARY KEY (`id`) )
其实我也说不清楚做这个的意义是什么,反正就当练练手好了。
对了,顺便看一下老美对动漫的评分吧。





Comments | NOTHING