大数据采集之Scrapy 爬虫实例 抓取豆瓣小组信息并保存到mongodb中
沉沙 2019-04-25 来源 : 阅读 449 评论 0

摘要:本篇文章探讨了大数据采集之Scrapy 爬虫实例 抓取豆瓣小组信息并保存到mongodb中,希望阅读本篇文章以后大家有所收获,帮助大家对相关内容的理解更加深入。

本篇文章探讨了大数据采集之Scrapy  爬虫实例 抓取豆瓣小组信息并保存到mongodb中,希望阅读本篇文章以后大家有所收获,帮助大家对相关内容的理解更加深入。

大数据采集之Scrapy  爬虫实例 抓取豆瓣小组信息并保存到mongodb中


这里我用的是scrapy0.24版本

先来个成品好感受这个框架带来的便捷性


最近想学git 于是把代码放到 git-osc上了:

https://git.oschina.net/1992mrwang/doubangroupspider


先说明下这个玩具爬虫的目的

能够将种子URL页面当中的小组进行爬取 并分析出有关联的小组连接 以及小组的组员人数 和组名等信息

出来的数据大概是这样的

{    'RelativeGroups': [u'//www.douban.com/group/10127/',
                        u'//www.douban.com/group/seventy/',
                        u'//www.douban.com/group/lovemuseum/',
                        u'//www.douban.com/group/486087/',
                        u'//www.douban.com/group/lovesh/',
                        u'//www.douban.com/group/NoAstrology/',
                        u'//www.douban.com/group/shanghaijianzhi/',
                        u'//www.douban.com/group/12658/',
                        u'//www.douban.com/group/shanghaizufang/',
                        u'//www.douban.com/group/gogo/',
                        u'//www.douban.com/group/117546/',
                        u'//www.douban.com/group/159755/'],
     'groupName': u'\u4e0a\u6d77\u8c46\u74e3',
     'groupURL': '//www.douban.com/group/Shanghai/',
     'totalNumber': u'209957'}

有啥用 其实这些数据就能够分析小组与小组之间的关联度等,如果有心还能抓取到更多的信息。不在此展开 本文章主要是为了能够快速感受一把。


首先就是 start 一个新的名为douban的项目

# scrapy startproject douban

# cd douban

这是整个项目的完整后的目录 
ps 放到git-osc时候为了美观改变了项目主目录名称 clone下来无影响
mrwang@mrwang-ubuntu:~/student/py/douban$ tree
.
├── douban
│   ├── __init__.py
│   ├── items.py           # 实体
│   ├── pipelines.py     # 数据管道文件
│   ├── settings.py       # 设置
│   └── spiders
│       ├── BasicGroupSpider.py  # 真正进行爬取的爬虫
│       └──  __init__.py
├── nohup.out              # 我用nohup 进行后台运行生成的一个日志文件
├── scrapy.cfg
├── start.sh                   # 为了方便写的启动shell 很简单
├── stop.sh                   # 为了方便写的停止shell 很简单
└── test.log                   # 抓取时生成的日志 在启动脚本中就有


编写实体 items.py , 主要是为了抓回来的数据可以很方便的持久化

mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# //doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
 
class DoubanItem(Item):
    # define the fields for your item here like:
    # name = Field()
    groupName = Field()
    groupURL = Field()
    totalNumber = Field()
    RelativeGroups = Field()
    ActiveUesrs = Field()


编写爬虫并自定义一些规则进行数据的处理

mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/spiders/BasicGroupSpider.py
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from douban.items import DoubanItem
import re
class GroupSpider(CrawlSpider):
    # 爬虫名
    name = "Group"
   
    allowed_domains = ["douban.com"]
    # 种子链接
    start_urls = [
        "//www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9",
        "//www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB",
        "//www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A",
        "//www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF",
        "//www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF",
        "//www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F",
        "//www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A",
        "//www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3"
    ]
 
     # 规则 满足后 使用callback指定的函数进行处理   
    rules = [
        Rule(SgmlLinkExtractor(allow=('/group/[^/]+/$', )),
callback='parse_group_home_page', process_request='add_cookie'),
        Rule(SgmlLinkExtractor(allow=('/group/explore\?tag', )), follow=True,
process_request='add_cookie'),
    ]
 
    def __get_id_from_group_url(self, url):
        m =  re.search("^//www.douban.com/group/([^/]+)/$", url)
        if(m):
            return m.group(1)
        else:
            return 0
 
    def add_cookie(self, request):
        request.replace(cookies=[
 
        ]);
        return request;
 
    def parse_group_topic_list(self, response):
        self.log("Fetch group topic list page: %s" % response.url)
        pass
 
 
    def parse_group_home_page(self, response):
 
        self.log("Fetch group home page: %s" % response.url)
        
        # 这里使用的是一个叫 XPath 的选择器
        hxs = HtmlXPathSelector(response)
        item = DoubanItem()
 
        #get group name
        item['groupName'] = hxs.select('//h1/text()').re("^\s+(.*)\s+$")[0]
 
        #get group id
        item['groupURL'] = response.url
        groupid = self.__get_id_from_group_url(response.url)
 
        #get group members number
        members_url = "//www.douban.com/group/%s/members" % groupid
        members_text = hxs.select('//a[contains(@href, "%s")]/text()' % members_url).re("\((\d+)\)")
        item['totalNumber'] = members_text[0]
        #get relative groups
        item['RelativeGroups'] = []
        groups = hxs.select('//div[contains(@class, "group-list-item")]')
        for group in groups:
            url = group.select('div[contains(@class, "title")]/a/@href').extract()[0]
            item['RelativeGroups'].append(url)       
        return item


编写数据处理的管道这个阶段我会把爬虫收集到的数据存储到mongodb当中去

mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: //doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy import log
from scrapy.conf import settings
from scrapy.exceptions import DropItem
class DoubanPipeline(object):
    def __init__(self):
        self.server = settings['MONGODB_SERVER']
        self.port = settings['MONGODB_PORT']
        self.db = settings['MONGODB_DB']
        self.col = settings['MONGODB_COLLECTION']
        connection = pymongo.Connection(self.server, self.port)
        db = connection[self.db]
        self.collection = db[self.col]
    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        log.msg('Item written to MongoDB database %s/%s' % (self.db, self.col),level=log.DEBUG, spider=spider)
        return item


在设置类中设置 所使用的数据处理管道 以及mongodb连接参数 和 user-agent 躲避爬虫被禁

mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for douban project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     //doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
# 设置等待时间缓解服务器压力 并能够隐藏自己
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
# 配置使用的数据管道
ITEM_PIPELINES = ['douban.pipelines.DoubanPipeline']
MONGODB_SERVER='localhost'
MONGODB_PORT=27017
MONGODB_DB='douban'
MONGODB_COLLECTION='doubanGroup'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+//www.yourdomain.com)'


OK 一个玩具爬虫就简单的完成了

启动启动命令

nohup scrapy crawl Group --logfile=test.log &


===========================  2014/12/02 更新 ===================================

在github上发现已经有人 和我想的一样 重新写了一个调度器 使用mongodb进行存储需要接下来访问的页面,于是照着模仿了一遍写一个来用

mrwang@mrwang-ThinkPad-Edge-E431:~/student/py/douban$ cat douban/scheduler.py
from scrapy.utils.reqser import request_to_dict, request_from_dict
import pymongo
import datetime
class Scheduler(object):
    def __init__(self, mongodb_server, mongodb_port, mongodb_db, persist, queue_key, queue_order):
        self.mongodb_server = mongodb_server
        self.mongodb_port = mongodb_port
        self.mongodb_db = mongodb_db
        self.queue_key = queue_key
    self.persist = persist
    self.queue_order = queue_order
    def __len__(self):
        return self.client.size()
    @classmethod
    def from_crawler(cls, crawler):
    settings = crawler.settings
    mongodb_server = settings.get('MONGODB_QUEUE_SERVER', 'localhost')
    mongodb_port = settings.get('MONGODB_QUEUE_PORT', 27017)
    mongodb_db = settings.get('MONGODB_QUEUE_DB', 'scrapy')
        persist = settings.get('MONGODB_QUEUE_PERSIST', True)
        queue_key = settings.get('MONGODB_QUEUE_NAME', None)
        queue_type = settings.get('MONGODB_QUEUE_TYPE', 'FIFO')
    if queue_type not in ('FIFO', 'LIFO'):
        raise Error('MONGODB_QUEUE_TYPE must be FIFO (default) or LIFO')
    if queue_type == 'LIFO':
        queue_order = <spa    

      本文由职坐标整理发布,学习更多的相关知识,请关注职坐标IT知识库!

本文由 @沉沙 发布于职坐标。未经许可,禁止转载。
喜欢 | 0 不喜欢 | 0
看完这篇文章有何感觉?已经有0人表态,0%的人喜欢 快给朋友分享吧~
评论(0)
后参与评论

您输入的评论内容中包含违禁敏感词

我知道了

助您圆梦职场 匹配合适岗位
验证码手机号,获得海同独家IT培训资料
选择就业方向:
人工智能物联网
大数据开发/分析
人工智能Python
Java全栈开发
WEB前端+H5

请输入正确的手机号码

请输入正确的验证码

获取验证码

您今天的短信下发次数太多了,明天再试试吧!

提交

我们会在第一时间安排职业规划师联系您!

您也可以联系我们的职业规划师咨询:

小职老师的微信号:z_zhizuobiao
小职老师的微信号:z_zhizuobiao

版权所有 职坐标-一站式IT培训就业服务领导者 沪ICP备13042190号-4
上海海同信息科技有限公司 Copyright ©2015 www.zhizuobiao.com,All Rights Reserved.
 沪公网安备 31011502005948号    

©2015 www.zhizuobiao.com All Rights Reserved

208小时内训课程