爬什么
把李毅吧的网页爬取下来,并保存到本地。
 
怎么做
url的规律是:按程序员的计数习惯,第一页就是第0页。
| 12
 3
 4
 5
 
 | 第0页内容 pn=0第1页内容 pn=50
 第2页内容 pn=100
 ……
 第i页内容 Pn=50*i
 
 | 

代码:
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 
 | import requests
 class tieba_spider:
 def __init__(self,tieba_name):
 self.tieba_name = tieba_name
 self.url_temp = "https://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}"
 self.headers = {"User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Mobile Safari/537.36"}
 
 def get_url_list(self):
 url_list = []
 for i in range(1000):
 url_list.append(self.url_temp.format(i*50))
 return url_list
 
 def parse_url(self,url):
 r = requests.get(url,headers=self.headers)
 print(url)
 return r.content.decode()
 
 def save_html(self,html_str,page_num):
 file_path = "{}-第{}页.html".format(self.tieba_name,page_num)
 with open(file_path,"w",encoding="utf8") as f:
 f.write(html_str)
 
 def run(self):
 
 url_list = self.get_url_list()
 
 for url in url_list:
 html_str = self.parse_url(url)
 
 page_num = url_list.index(url)+1
 self.save_html(html_str,page_num)
 
 if __name__ == '__main__':
 tieba_name = "李毅"
 my_tieba_spider = tieba_spider(tieba_name)
 my_tieba_spider.run()
 
 | 
- 乱码问题
 写入文件这个操作,需要加上编码方法,"encoding=utf8",解决以下乱码问题
  
总结
- 灵活应用万能的format
- 列表元素的位置,可使用list.index("元素1")来表示。list = ["元素1","元素2"]