爬什么
把李毅吧的网页爬取下来,并保存到本地。
怎么做
url的规律是:按程序员的计数习惯,第一页就是第0页。
1 2 3 4 5
| 第0页内容 pn=0 第1页内容 pn=50 第2页内容 pn=100 …… 第i页内容 Pn=50*i
|
代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| import requests
class tieba_spider: def __init__(self,tieba_name): self.tieba_name = tieba_name self.url_temp = "https://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}" self.headers = {"User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Mobile Safari/537.36"} def get_url_list(self): url_list = [] for i in range(1000): url_list.append(self.url_temp.format(i*50)) return url_list
def parse_url(self,url): r = requests.get(url,headers=self.headers) print(url) return r.content.decode()
def save_html(self,html_str,page_num): file_path = "{}-第{}页.html".format(self.tieba_name,page_num) with open(file_path,"w",encoding="utf8") as f: f.write(html_str)
def run(self): url_list = self.get_url_list() for url in url_list: html_str = self.parse_url(url) page_num = url_list.index(url)+1 self.save_html(html_str,page_num)
if __name__ == '__main__': tieba_name = "李毅" my_tieba_spider = tieba_spider(tieba_name) my_tieba_spider.run()
|
- 乱码问题
写入文件这个操作,需要加上编码方法,"encoding=utf8"
,解决以下乱码问题
总结
- 灵活应用万能的format
- 列表元素的位置,可使用
list.index("元素1")
来表示。list = ["元素1","元素2"]