2022年网络:我的网络爬虫学习之旅

2022年网络我的网络爬虫心得

本文主要记录我学习python并成功爬取诗词吾爱(首页 – 诗词吾爱网 www.52shici.com)数据,在这个过程中尝试爬过周到上海、国际在线都没有成功,前路漫漫。

在这学期的课程中也学会了使用很多工具,Anaconda、Pycharm、Mysql等等。python的功能很多,爬虫只是一小部分,学习进步空间还有很大。

1. 不要急于求成,编程虽然不难,但也没有那么简单,不要想着速成,特别是对于计算机基础不是很好的人。

2. 学习的过程中可能会遇到很多困,或许会有很多没有接触的东西,善用google,一个个问题地解决,缺什么补什么。

3. 对于初学者来讲,最重要的不是去学各种吊炸天的框架,追最新的技术。技术,框架是学不完的,永远都会层出不穷,最重要的是把基础学好。很多时候你有一个问题解决不了,都是你某些方面的知识缺了。慢慢来,不要急,随着学习的深入,再回过头来看以前的问题,会有豁然开朗的感觉。

4. 一定要动手做,找点成就感,对你继续做下去有很大的促进作用。不然的话,遇到点困难很容易就放弃了。

5.坚持。

requests

requests库是一个简洁的能够简单地处理HTTP请求的第三方库,requests库特点:简单、简洁,是python的第三方库,也是网络爬虫常用的库。

BS4

BS4全称是Beatiful Soup,它提供⼀些简单的、python式的函数⽤来处理导航、搜索、修改分析树等功能。法:pip install requests。

pandas

pandas 是基于NumPy 的一种工具,该工具是为解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

Selenium

Selenium是一个基于浏览器的自动化工具,它提供了一种跨平台、跨浏览器的端到端的web自动化解决方案。其他测试工具都不能覆盖如此多的平台。使用 Selenium 和在浏览器中运行测试还有很多其他好处。这个工具的主要功能包括:测试与浏览器的兼容性——测试你的应用程序看是否能够很好得工作在不同浏览器和操作系统之上。

Scrapy

Scrapy是适用于Python的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

gerapy_auto_extractor

Gerapy 是一款分布式爬虫管理框架,支持 Python 3,基于 Scrapy、Scrapyd、Scrapyd-Client、Scrapy-Redis、Scrapyd-API、Scrapy-Splash、Jinjia2、Django、Vue.js 开发,Gerapy 可以帮助我们:更方便地控制爬虫运行,更直观地查看爬虫状态,更实时地查看爬取结果,更简单地实现项目部署,更统一地实现主机管理。

一、简单爬虫实现

打开WIN+R搜索cmd,输入如下代码。

结果: 

二、Scrapy爬虫框架的使用

实验内容:用Scrapy爬虫框架抓取指定网页的内容,并将数据存入MongoDB数据库。

创建Scrapy项目

下载安装Scrapy环境

代码如下:

pip install scrapy==2.6.1

使用组合键Win+R并输入cmd打开终端,并在终端中输入代码

 建立爬虫项目

首先进入我们创建爬虫项目的文件夹,点击右键进入Power Shell终端,使用如下代码创建新项目

代码如下

scrapy startproject poem

 配置Scrapy框架

使用的软件是pycharm

items.py的配置

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class PoemdataItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()#标题
    url = scrapy.Field()#网址
    date = scrapy.Field()#日期
    content = scrapy.Field()#文章正文
    site = scrapy.Field()#站点
    item = scrapy.Field()#栏目
    student_id = scrapy.Field()#学号

 middlewares.py的配置

# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class PoemdataSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class PoemdataDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from scrapy.utils.project import get_project_settings
import random
 
settings = get_project_settings()
 
class RotateUserAgentMiddleware(UserAgentMiddleware):
    def process_request(self, request, spider):
        referer = request.url
        if referer:
            request.headers["referer"] = referer
        USER_AGENT_LIST = settings.get('USER_AGENT_LIST')
        user_agent = random.choice(USER_AGENT_LIST)
        if user_agent:
            request.headers.setdefault('user-Agent', user_agent)
            print(f"user-Agent:{user_agent}")

pipelines.py的配置

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymongo
from scrapy.utils.project import get_project_settings
 
settings = get_project_settings()

class PoemdataPipeline:
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DATABASE"]
        sheetname = settings["MONGODB_TABLE"]
        # username = settings["MONGODB_USER"]
        # password = settings["MONGODB_PASSWORD"]
        # 创建MONGODB数据库链接
        client = pymongo.MongoClient(host=host, port=port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.post = mydb[sheetname]
 
    def process_item(self, item, spider):
        data = dict(item)
        # 数据写入
        self.post.insert_one(data)
        return item

settings.py的配置

# Scrapy settings for PoemData project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'PoemData'

SPIDER_MODULES = ['PoemData.spiders']
NEWSPIDER_MODULE = 'PoemData.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'PoemData (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'PoemData.middlewares.PoemdataSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'PoemData.middlewares.RotateUserAgentMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'PoemData.pipelines.PoemdataPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

USER_AGENT_LIST = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
 
 
 # 添加MONGODB数仓设置
MONGODB_HOST = "localhost"  # 数仓IP
MONGODB_PORT = 27017  # 数仓端口号
MONGODB_DATABASE = "PoemData"  # 数仓数据库
MONGODB_TABLE = "Poem_Process_A"  # 数仓数据表单
HTTPERROR_ALLOWED_CODES = [404]#上面报的是403,就把403加入。

创建爬虫python文件

scrapy genspider haiwainet " "

编辑爬虫代码

# -*- coding: utf-8 -*-
import scrapy
from PoemData.items import PoemdataItem
from bs4 import BeautifulSoup

class PoemSpider(scrapy.Spider):
    name = 'poem'
    allowed_domains = [' ']
    start_urls = [
        ['https://www.52shici.com/audios.php','诗词吾爱网','朗诵1','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=2','诗词吾爱网','朗诵2','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=3','诗词吾爱网','朗诵3','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=4','诗词吾爱网','朗诵4','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=5','诗词吾爱网','朗诵5','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=6','诗词吾爱网','朗诵6','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=7','诗词吾爱网','朗诵7','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=8','诗词吾爱网','朗诵8','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=9','诗词吾爱网','朗诵9','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=10','诗词吾爱网','朗诵10','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=11','诗词吾爱网','朗诵11','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=12','诗词吾爱网','朗诵12','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=13','诗词吾爱网','朗诵13','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=14','诗词吾爱网','朗诵14','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=15','诗词吾爱网','朗诵15','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=16','诗词吾爱网','朗诵16','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=17','诗词吾爱网','朗诵17','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=18','诗词吾爱网','朗诵18','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=19','诗词吾爱网','朗诵19','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=20','诗词吾爱网','朗诵20','20201928张文睿'],
        ['https://www.52shici.com/audios.php?page=21','诗词吾爱网','朗诵21','20201928张文睿'],
    ]
    def start_requests(self):
        for url in self.start_urls:
            item = PoemdataItem()
            item["site"] = url[1]
            item["item"] = url[2]
            item["student_id"] = url[3]
            #self._monkey_patching_HTTPClientParser_statusReceived()
            yield scrapy.Request(url=url[0],meta={"item": item},callback=self.parse,dont_filter=True)
 
    def parse(self, response):
        item = response.meta["item"]
        site_ = item["site"]
        item_ = item["item"]
        student_id_ = item["student_id"]
 
        title_list = response.xpath('//*[@id="listWorks"]/li/span[@class="l"]/a/span/text()').extract()
        url_list = response.xpath('//*[@id="listWorks"]/li/span[@class="l"]/a/@href').extract()
        date_list = response.xpath('//*[@id="listWorks"]/li/span[@class="r"]/text()').extract()
        for each in range(len(title_list)):
            item = PoemdataItem()
            item["title"] = title_list[each]
            # item["url"] = "https://www.diyifanwen.com/zuowen/yilunwen/" + str(url_list[each])
            item["url"] = "https://www.52shici.com/" + url_list[each]
            item["date"] = date_list[each]
            item["site"] = site_
            item["item"] = item_
            item["student_id"]= student_id_
            yield scrapy.Request(url=item["url"],meta={"item": item},callback=self.parse_detail,dont_filter=True)
 
    def parse_detail(self, response):
        #data = extract_detail(response.text)
        item = response.meta["item"]
        strs = response.xpath('//div[@class="works-content"]').extract_first()
        item["content"] = BeautifulSoup(strs,'lxml').text
        return item

执行爬虫代码

scrapy crawl poem

在Navicat查看,结果如图:

三、搭建Gerapy

cmd

pip install gerapy==0.9.11
pip install scrapyd

在目标文件夹下打开终端,在终端中执行如下代码

gerapy init
gerapy migrate

在终端中执行如下代码

gerapy initadmin

用户名是admin,密码是admin

在终端中执行如下代码

gerapy runserver 0.0.0.0:8000

搜索scrapyd文件所在位置,powershell:scrapyd.exe 保留窗口。

然后在浏览器中登录到127.0.0.1:8000

编辑项目

(1)主机管理

项目管理
将爬虫项目的文件复制到gerapy文件下的projects文件中,在浏览器中刷新。

任务管理
根据自己的需要设置时间间隔。
推荐使用Hong_Kong时间

总结

以上是我分享的内容,简单介绍了Scrapy和Gerapy的使用。

物联沃分享整理
物联沃-IOTWORD物联网 » 2022年网络:我的网络爬虫学习之旅

发表评论