大数据采集利用python爬取58同城简历数据-职坐标

大数据采集利用python爬取58同城简历数据

沉沙 2018-10-11 来源：阅读 4733 评论 0

摘要：本篇教程介绍了大数据采集利用python爬取58同城简历数据，希望阅读本篇文章以后大家有所收获，帮助大家对大数据云计算大数据采集的理解更加深入。

本篇教程介绍了大数据采集利用python爬取58同城简历数据，希望阅读本篇文章以后大家有所收获，帮助大家对大数据云计算大数据采集的理解更加深入。

最开始想到是用python里面的scrapy框架制作爬虫。但是在制作的时候，发现内容不能被存储在本地变量 response 中。当我通过shell载入网页后，虽然内容能被储存在response中，用xpath对我需要的数据进行获取时，返回的都是空值。考虑到数据都在源码中，于是我使用python里的beautifulSoup通过下载源码的方式去获取数据，然后插入到数据库。
需要的python包urllib2?,beautifulSoup，MySQLdb，re
第一，获取整个页面
coding:utf-8
import urllib2
from BeautifulSoup import BeautifulSoup
?url=‘//jianli.58.com/resume/91655325401100‘
content = urllib2.urlopen(url).read()
soup=BeautifulSoup(content)
print soup

url为需要下载的网页
通过urllib2.urlopen()方法打开一个网页
read()方法读取url上的数据
第二，筛选你想要的数据
这里需要用到正则表达式，python提供了强大的正则表达式，不清楚的小伙伴可以参考一下资料（//www.runoob.com/regexp/regexp-syntax.html）
比如，我们需要获取姓名
通过控制台可以看到名字所在的位置
这里写图片描述
可用正则表达式进行匹配，代码如下：
name = re.findall(r‘(?<=class="name">).*?(?=)‘,str(soup))

运行程序，发现返回结果为空。
检查正则表达式是无误的，我们观察之前返回的soup，发现他返回的源码与网页上的源码是不一样的。所有我们根据观察网页上的源码写的正则表达式不能再返回的源码中匹配到相应的内容。因此我们只能通过观察返回的源码写正则表达式。
这里写图片描述
在soup返回的源码中，我们很容易地找到这个人的全部基本资料，而且都在标签为< li class=”item”>中，通过下面的fandAll()方法，很容易就获取内容
data = soup.findAll(‘li‘,attrs={‘class‘:‘item‘})

通过上面的代码，可以的到如下的结果，可见返回了一个list
这里写图片描述
这样，我们就获取了这个人的姓名，性别，年龄，工作经验和学历。
通过上面的方法，我们能够获取整个页面你所需要的数据。
第三，把数据保存到数据库
我使用的是mysql数据库，所以这里以mysql为例
连接数据库
conn = MySQLdb.Connect(
host = ‘127.0.0.1‘,
port = 3306,user = ‘root‘,
passwd = ‘XXXXX‘,
db = ‘XXXXX‘,
charset = ‘utf8‘)
cursor = conn.cursor()

因为要存储中文，所以在这里设置编码格式为utf8
创建插入语句
sql_insert = "insert into resume(
ID,name,sex,age,experience,education,pay,ad
,job,job_experience,education_experience)
values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"

插入数据
cursor.execute(sql_insert,(ID,name,sex,age,experience,education
,pay,ad,job,job_experience,education_experience))
conn.commit()

关闭数据库
cursor.close()
conn.close()

执行程序
报错了…
(1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘))‘ at line 1")

发生这个错误，如果sql语法没错，一般就是编码有问题了。
我们的数据库使用的编码是utf8，应该是插入的数据在编码上出现问题了。
我们对返回的数据进行重新编码用decode()和encode()方法实现
name = data[0].decode(‘utf-8‘).encode(‘utf-8‘)

用这个简单的方法，我们就解决了数据库编码与数据编码不一致导致出错的问题。
为什么编码会不一样呢？
这是因为，我们用BeautifulSoup包爬取网页的时候，返回的数据是ascii编码的数据。而我们的数据库为utf8编码的，所有插入数据是会发生错误，只要对爬取的数据重新进行编码
结果
这里写图片描述
这个是我爬取的结果，效果还是挺好的，速度大概是1秒个网页，虽然比起scrapy要慢好多，但是BeautifulSoup和urllib2使用简单，适合新手练手。
附录:代码
coding:utf-8
import urllib2
from BeautifulSoup import BeautifulSoup
import re
import MySQLdb
url = ‘//jianli.58.com/resume/91655325401100‘
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
basedata = str(soup.findAll(‘li‘,attrs={‘class‘:‘item‘}))
basedata = re.findall(r‘(?<=class="item">).?(?=)‘,basedata)
ID = str(soup.findAll(‘script‘,attrs={‘type‘:‘text/javascript‘}))
ID = re.findall(r‘(?<=global.ids = ").?(?=";)‘,ID)
ID = ID[0].decode(‘utf-8‘).encode(‘utf-8‘)
name = basedata[0].decode(‘utf-8‘).encode(‘utf-8‘)
sex = basedata[1].decode(‘utf-8‘).encode(‘utf-8‘)
age = basedata[2].decode(‘utf-8‘).encode(‘utf-8‘)
experience = basedata[3].decode(‘utf-8‘).encode(‘utf-8‘)
education = basedata[4].decode(‘utf-8‘).encode(‘utf-8‘)
pay = str(soup.findAll(‘dd‘,attrs={None:None}))
pay = re.findall(r‘(?<=)\d+.?(?=)‘,pay)
pay = pay[0].decode(‘utf-8‘).encode(‘utf-8‘)
expectdata = str(soup.findAll(‘dd‘,attrs={None:None}))
expectdata = re.findall(r‘‘‘(?<=["‘]>)[^<].?(?=)‘‘‘,expectdata)
ad = expectdata[0].decode(‘utf-8‘).encode(‘utf-8‘)
job = expectdata[1].decode(‘utf-8‘).encode(‘utf-8‘)
job_experience = str(soup.findAll(‘div‘,attrs={‘class‘:‘employed‘}))
job_experience = re.findall(r‘(?<=>)[^<].?(?=<)‘,job_experience)
job_experience = ‘‘.join(job_experience).decode(‘utf-8‘).encode(‘utf-8‘)
education_experience = str(soup.findAll(‘dd‘,attrs={None:None}))
education_experience = re.findall(r‘(?<=).\n.\n?.‘,education_experience)
education_experience = ‘‘.join(education_experience).decode(‘utf-8‘).encode(‘utf-8‘)
conn = MySQLdb.Connect(
host = ‘127.0.0.1‘,
port = 3306,user = ‘root‘,
passwd = ‘XXXXX‘,
db = ‘XXXX‘,
charset = ‘utf8‘)
cursor = conn.cursor()
sql_insert = "insert into resume(ID, name,sex,age,experience,education,pay,ad,job,job_experience,education_experience)
values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
try:
cursor.execute(sql_insert, (ID, name,sex,age,experience,education,pay,ad,job,job_experience,education_experience))
conn.commit()
except Exception as e:
print e
conn.rollback()
finally:
cursor.close()
conn.close()