爬什么
把人人网的个人主页爬取到本地。注册个人人网先~难得找到这样简单登录方式的网站,只需要输入用户名和密码。
怎么做
为了获取登录之后的页面,我们必须发送带有cookies的请求。
用法:
- 实例化一个session对象
- 让session发送get或者post请求
1 2
| session = requests.session() response = session.get(url,headers)
|
代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| import requests
session = requests.session() data = {"email":"3321647547@qq.com", "key_id":"1", "captcha_type":"web_login", "password":"77badfdb18634f82f4c27d9f7c2fee561a5167d23858e99ae7c7c4a63d53474e", "rkey":"33ce5d635f3ff6984d4952d303196766" } headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
url = "http://www.renren.com/ajaxLogin/login"
session.post(url,headers=headers,data=data)
my_url = "http://www.renren.com/973482150/profile" r = session.get(my_url,headers=headers)
with open("renren_profile.html","w",encoding="utf8") as f: f.write(r.content.decode())
|
效果:
吐个槽
这个网站确实不太行,无法直接点击注册。我是用QQ登录的,登录进去修改密码,才可以使用用户名密码这种登录方式……
使用cookie抓取个人主页
还可以直接使用登录后的cookie,由requests库发送get请求抓取个人主页。前提是这个网站的cookie过期时间够长,cookie过期了就没戏。
有两种方法:
- 在get方法中设置cookies参数,cookies为字典格式(以下代码用了设置cookies参数的方法)
- 在headers中添加cookies的键值对,cookies为字符串
代码:
1 2 3 4 5 6 7 8 9 10 11 12 13
| import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
cookies = "JSESSIONID=abcfPuhawk64MhQgDK4_w; anonymid=k5jjy9gb-p4f3be; depovince=SC; _r01_=1; taihe_bi_sdk_uid=41402119b11006539defa847b6465809; taihe_bi_sdk_session=c04acc71f9216e7983489d1dcf2ad3c1; ick_login=ffaf6e04-f058-459c-8b5c-975cfe140f36; springskin=set; jebe_key=538aac78-0044-476f-980c-0e53071b02d6%7C42d7b478811716336baa94c4523e9833%7C1579351511075%7C1%7C1579351510136; jebe_key=538aac78-0044-476f-980c-0e53071b02d6%7C42d7b478811716336baa94c4523e9833%7C1579351511075%7C1%7C1579351510138; vip=1; wp_fold=0; __utma=151146938.1215165037.1579351626.1579351626.1579351626.1; __utmc=151146938; __utmz=151146938.1579351626.1.1.utmcsr=mail.qq.com|utmccn=(referral)|utmcmd=referral|utmcct=/; ick=4d7cf42c-526e-4a82-baac-d9891e7f89bc; first_login_flag=1; ln_uact=3321647547@qq.com; ln_hurl=http://hdn.xnimg.cn/photos/hdn221/20200118/2045/h_main_TCCN_9c6c00011d14195a.jpg; jebecookies=92ee8c5d-b2ed-4d44-9998-03fc28212dca|||||; _de=04F2D88119EA2B63E16F2C7283EEE4526DEBB8C2103DE356; p=cec0ea11946dbbcf95fd2ce1f2e632ae0; t=7a2144f5aba19975660213d3701d7ec30; societyguester=7a2144f5aba19975660213d3701d7ec30; id=973482150; xnsid=4743cec2; ver=7.0; loginfrom=null" cookies = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")} print(cookies)
my_url = "http://www.renren.com/973482150/profile" r = requests.get(my_url,headers=headers,cookies=cookies)
with open("renren_profile_cookie.html","w",encoding="utf8") as f: f.write(r.content.decode())
|
总结
要抓取登录后的页面,要么使用session类做会话保持,要么使用未过期的cookie并且赶在cookie过期之前完成数据抓取。还有更高阶的方法,写两个程序,一个程序专门获取cookie,另一个程序专门发送请求抓取页面,现阶段只了解一下方法。