web123456

Share Small Crawler to Crawl Chain Home Map to Find House Price Data

一、说在前面

I was asked to crawl the data of the map to find a house on Chain:/ditu/
Above you'll find the average price of used homes and the number of units for sale by area, and we've been tasked with grabbing down that data.
这里写图片描述

II. Starting to work

2.1 Failed once

The usual.Chrome Press F12 to open the Network tab in Chrome DevTools to view the request process, such as this need to update the data is likely to be returned through the backend interface request, unfortunately in the XHR did not find the interface, a small discouragement.
Keep looking and found a suspicious request under JS:/map/search/ershoufang/
这里写图片描述
Hee hee, that's you. Then it's time for the familiar simulation of the request process, putting Headers, QueryString Parameters are brought along, after a flurry of maneuvers that are as fierce as a tiger:
这里写图片描述
What the hell ah, a simple GET request, simulated all the request headers and parameters, and even in the browser header can not be successfully requested again. I think the chain did something to limit, if there is a big cow accidentally pass by this piece can give some guidance that is the best. Thank you very much.

2.2 Setting sail again

This route did not work, so I had to look for something else. The Chain's map to find homes feature actually looks for used homes, so I went to the home price page for used homes and found a similar page displaying data./fangjia/
这里写图片描述
This page is just interesting, with a straight up, blatantly obvious interface.
Request URL:/fangjia/priceMap/
The numbers are also very bright and shiny:
这里写图片描述
Unfortunately this interface doesn't come with the number of sets as data, whatever. Climb down first.

2.3 Simple Interpretation of Code

import requests
from bs4 import BeautifulSoup
from datetime import datetime

def get_city_list():
    city_list = {}
    city_from_url = '/city/'
    mhtml = (city_from_url)
    mobj = BeautifulSoup(,'lxml')
    city_block = .find_all('div',{'class':'block city_block'})
    for cb in city_block:
        for cba in cb.find_all('a'):
            city_list[('href')] = cba.get_text()
    return city_list

if __name__ == '__main__':
    cityd = get_city_list()
    f = open('houing_price_bycity.csv','w',encoding = 'utf-8')
    for citycode,city in ():
        url = 'https://{}./fangjia/priceMap/'.format(citycode[1:-1])
        try:
            r = (url)
            if r.status_code == 200:
                res = ()
            else:
                continue
        except:
            continue
        for k,v in ():
            if isinstance(v,(int,float)):
                pass
            else:
                cont = ','.join([city,v['name'],str(v['transPrice']),().strftime("%Y-%m-%d %H:%M:%S")])
                cont = cont + '\n'
                (cont)
    ()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

First, from a Chain Home mobile page, we grab all the city lists and their code names within Chain Home. Replace the city part of the request with the code, so as to loop through the list of cities and grab the house price information of all cities by region. In practice, we found that not all cities have this interface, mainly some big cities.
Because it's a small self-usereptileSome of the error handling is also very simple and rough, the request process is not set to retry the module, there is no limit to the speed of access to deal with the possible existence of the chain of anti-climbing mechanism (the results of the interface probability of not doing anti-climbing), the results are also directly written into the current directory of the CSV document. Obviously there are many points that can be optimized. whatever, they use a small crawler will not pay so much attention to it.
The final climb down ended up looking like this:
这里写图片描述

close

This is the end of this article, but careful kids may realize that something seems to be missing.
没错,一开始链家地图找房的功能里是有房价和在售套数两个数据的,现在我们只获取了其中一个。因为套数这个数据我们在接口里没有找到,不过我仍然找到了数据获取的地方。
在链家-二手房-在售页面,By searching by region you can see this data displayed directly on the page:
这里写图片描述
This data is written directly in the html code of the web page, and it can be grabbed by region and matched with the previous house price data. We also know that this data will not be updated very quickly, this step will be left to those who need to do it. Methods have been given to the realization of nature is not a difficult task.
This post is officially closed, 88.