web123456

Python Crawl Chain Home Home Price Information

The house problem has been getting more and more attention in recent years, to understand the house price in recent years, first of all, we have to get the house price information on the Internet, we take the information of house price sold on the Internet by Chain Home as an example, and crawl down the data and store it.

This time, we still take the line of requests-Beautiful Soup to crawl the information of selling houses on the Chain Home website. You need to install anaconda, and ensure that the system already has the requests library, Beautiful Soup4 library and csv library has been installed.


Web analytics

The page we are going to crawl is as follows, the information we need is the name and price of the house

/ershoufang/ 

Below:


Let's analyze the location of the information we want to extract, open the developer mode to find the element, we find the name and price of the house; as shown below:


We can see that the information we need about the name of the house is in {div class="title"}, the price information is in {div class="totalPrice"}, and all the information is encapsulated in the li tag.

We analyzed the structure of the web pages inside a web page, to crawl other web pages we have to see more structure;

First web link: /ershoufang/pg1

Link to second web page: /ershoufang/pg2

Third web link: /ershoufang/pg3

The links on each page end in pg + a certain page number, and we use this to spot patterns.


program structure

Next we write the main structure of the program:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import csv
  4. def getHTMLText(url): #Get the web page information
  5. return ""
  6. def get_data(list,html): #get_data_and_store_it
  7. def main():
  8. start_url = "/ershoufang/pg"
  9. depth = 20
  10. info_list = []
  11. for i in range(depth):
  12. url = start_url + str(i)
  13. html = getHTMLText(url)
  14. get_data(info_list,html)
  15. main()

According to eachfunction (math.)The purpose of the function to fill in the specifics of the function, the first function to get the web page information:

  1. def getHTMLText(url):
  2. try:
  3. r = (url)
  4. r.raise_for_status()
  5. = r.apparent_encoding
  6. return
  7. except:
  8. return "Exception thrown"

第二个函数是要获取数据并存储在文件中;

  1. def get_data(list,html):
  2. soup = BeautifulSoup(html,'')
  3. infos = ('ul',{'class':'sellListContent'}).find_all('li') #find all li tags
  4. with open(r'/Users/11641/Desktop/','a',encoding='utf-8') as f: #Create a file and write to it.
  5. for info in infos:
  6. name = ('div',{'class':'title'}).find('a').get_text()
  7. price =('div',{'class':'priceInfo'}).find('div',{'class','totalPrice'}).find('span').get_text()
  8. ("{},{}\n".format(name,price))

Above we have finished writing the crawler program

However, it should be noted that: run the completed program to save the csv file, if you open directly with excel may have a garbled phenomenon, generally with the following steps to solve the problem:

  1. Open the file with Notepad, Save As select the encoding as: ANSI and store the file;
  2. The new file will be stored in excel open, save as excel file format, you can permanently store the file

Let's take a look at the effect of file storage as shown below:


program code

Above we have successfully written the code for the Chain Home used house crawler, and successfully output the results we roughly want, all the code is as follows;

  1. #Crawl Chain Home Used Home Information
  2. import requests
  3. from bs4 import BeautifulSoup
  4. import csv
  5. def getHTMLText(url):
  6. try:
  7. r = (url,timeout=30)
  8. r.raise_for_status()
  9. = r.apparent_encoding
  10. return
  11. except:
  12. return 'Exception thrown'
  13. def get_data(list,html):
  14. soup = BeautifulSoup(html,'')
  15. infos = ('ul',{'class':'sellListContent'}).find_all('li')
  16. with open(r'/Users/11641/Desktop/','a',encoding='utf-8') as f:
  17. for info in infos:
  18. name = ('div',{'class':'title'}).find('a').get_text()
  19. price =('div',{'class':'priceInfo'}).find('div',{'class','totalPrice'}).find('span').get_text()
  20. ("{},{}\n".format(name,price))
  21. def main():
  22. start_url = '/ershoufang/pg'
  23. depth = 20
  24. info_list =[]
  25. for i in range(depth):
  26. url = start_url + str(i)
  27. html = getHTMLText(url)
  28. get_data(info_list,html)
  29. main()

Summary:

1. Crawler mainly with requests library and Beautiful Soup library can be succinctly crawl the web page information;

2.First set the main framework of the program, and then according to the purpose of the demand to fill the function content: get web page information > crawl web page data > store data;

3. For all the information stored on multiple pages, it is important to observe the web page information and construct a url link to each page to solve the problem;

4. The most important thing is to parse the structure of the web page, it is best to use the form of the tag tree to determine the label where the field is located, and traverse all the labels to store data.


I have not been learning python for too long, there are some understanding and writing errors in the article, I hope you understand and place, thank you!
In addition, the python series of courses by Mr. Songtian of Beijing Institute of Technology is highly recommended.
For more information about me, please visit my website at /(Currently in maintenance request status)