Airbnb 2019 New York Housing Situation Analysis Report
Experimental Background:
Since 2008, more and more people will choose to travel or vacation through AIRBNB to select. The reason is that they have not only ordinary apartments in the city, but also some chalets, villas and other unique properties in order to allow you to travel and experience the culture brought by different cities, but also can have more different experiences.
Now that Airbnb has expanded its services globally, for millions of listings, it has become important for such a large company to utilize the analysis of data to keep tabs on the company's operations.
Purpose of the experiment:
The following is a brief description of theKAGGLEThe analysis of this data provided allows us to understand the data in multiple dimensions.
For website operators, it can be a guide or inspiration for their subsequent marketing programs or some creative featured services (e.g. photography add-ons for homeowners).
For the users, i.e. us, we can also make a better and faster choice of listings by looking at the 19 years of AIRBNB rates for New York City, including and geographic location if we choose to go on vacation to New York next time.
For landlords, they can also get a glimpse of the overall direction and the types of properties that users are interested in and their preferred price points, and then make their own rental scheduling plans based on these.
Experimental Procedures:
1. Data sources:
This time the data was obtained from KAGGLE, link below:/dgomonov/new-york-city-airbnb-open-data/data
2. Import data:
Observe that the data has a total of nearly 40,000 rows, 16 columns of features, and in the latter columns there are many null values appear, I intercepted some of them in the following
3. Perform initial observation cleaning and organizing:
In the first step we start by checking exactly in which column the null value appears:
Name: Fill with 0. Because each hotel name corresponds to a host_id, we can use the id to analyze it, while the name is a quicker way to fill them with 0 for the sake of analyzing it.
Host_name: Delete. This column is a private information, it is better to delete it to protect the privacy when analyzing. It is not very useful for analyzing, id is the best way to identify an individual, it is unique and will not be null.
Last review & Reviews-per-month: Delete & Replace with 0. The reason for 0 is that no one is evaluating, and that directly leads to a null value for review per month, so we directly delete the column and populate the rpm with 0 to show that no one is evaluating per month, which is consistent with reality.
Id:: Delete. This analysis focuses on exploring and analyzing the listings, so you can leave the HOST_ID, which is mostly used when analyzing customer behavior and needing to follow every step of the customer's session.
After deleting and replacing the null value, check the data again as follows:
The data can be found to be clean and we can proceed to the next step in the analysis:
Start by analyzing what characteristics the room_type column has:
You can see that there are 3 main types of business: single rooms, whole houses, and shared houses.
And then after that is the feature view of neighborhood_group:
These also find entry points for subsequent dimensions of analysis.
4. Characterization:
The data were first analyzed in terms of individual features:
a. How many listings are there for each host, list the top 10
Visualize the results:
It can be seen:
- The first and second place have a very large number of listings owned compared to the next few, nearly 300+. And is it the most popular homeowner or the homeowner with the most Reviews? We can use the indexing method to bring up his listings for analysis.
We exported the most host_id i.e. 107434423 and selected one of the segments:
As you can see, the number of reviews is not very high, and after summing up, the total number of reviews for all the listings of this owner is 29, all of them are full rentals and the minimum number of rental days is 1 month. It can be seen that although the listings are in the first place, it may not be the main source of income for AIRBNB and the main direction for users to choose.
Then the influencing factors between multiple features are considered:
b. Whether districts affect the pricing of homes, and if so, which districts specifically
We first extracted the rows from different regions and later combined them for comparative analysis:
It can be seen that Manhattan has the highest average value. This also confirms that as a business center and tourist destination, the overall price driven rent is also at a higher level in New York, the extreme points have a greater impact, so in the subsequent drawing of the graph in order to study the overall trend will be removed from the price of more than 500 and focus on the part of the population with the highest number of people, I will use the box and line graph and fiddle chart 2 different visualization reflecting the average and the variance of each district.
Known from the chart:
- Manhattan exceeds the rest of the city in every sense of the word.
- The fiddle chart shows that Manhattan and *lyn are on the tall and skinny side, indicating a wide distribution of prices. While Queen, Bronx and Staten Island prices are all more centralized
- *lyn and Bronx are both in the lower median distribution, i.e., most of the listings are lower-priced, which lowers the median, but the high maxima indicate a wide distribution of listings at the higher end of the price spectrum.
c. Comparison of house prices with respect to area (latitude and longitude) and housing density characteristics
You can see that the red area with the highest price corresponds to the Manhattan area, which is densely populated and the price is also on the high side. And can be found in *lyn area of high price housing is also due to the proximity of Manhattan, do not rule out is driven by Manhattan this area price is high.
d. Comparison of house prices and types of housing
Extract the region and listing type for pivot table observations:
It can be seen that the price of the whole house for rent is generally high, visualizing its
It is obvious that the price of renting a whole house is about double the price of a single room, and even reaches more than twice the price of a multi-room. However the price of a single room is about the same as a multi-room.
But I don't know if there are enough listings, so the chart below is for a comparison of the number of listings on the "supply side", i.e. different housing types, in different areas:
Considering the number of listings, Manhattanh and *lyn have the most and are mostly whole houses and single rooms. There are very few shared houses, which cannot be ruled out as a result of user selection preferences or local listing constraints.
To summarize we can draw a general conclusion from the user, when you are traveling to New York and value value for money more, choosing a single room may be a good choice.
e. Next, let's consider the average price of hotels with most_reviews and explore whether we are saying that a high price point equals the most reviews?
We first ranked the hotels with the highest number of reviews and extracted the price points as well:
As you can see the price points are all within an acceptable range and not what we thought they would be, except for #9 which is on the high side. The average price point was calculated as
It can be concluded that the average price of the top 10 hotels with positive reviews is around $65.4. This allows users to consider finding hotels in this price range in addition to single room preferences when making a selection.
And since 9 out of these 10 are single rooms, it is conceivable that although the number of single rooms is slightly lower than the number of whole rooms (a conclusion drawn from the previous visualization), most people would choose single rooms as their travel accommodation choice.
This way, from the user's point of view, when he or she makes a choice, he or she can consider looking for a single room in this price range. This provides more feedback, better value for money, and makes it relatively easy to select a preferred temporary place to live.
From a homeowner's perspective, we can see that a large number of listings doesn't necessarily mean a proportional increase in revenue, and this needs to be taken into account for the majority of the clientele going to New York. For example, if the house is still divided into single rooms, perhaps the chances of renting will be greatly increased.
Experimental Conclusion:
From this initial analysis of the data, we can first see how different latitudes and longitudes, or regions, relate to the number of listings and prices: i.e., McHatton and *lyn near McHatton have higher price points and more listings.
At that level it shows that there is a high demand for this area. Then the company may be able to focus on pushing or tapping into the listings in this area. Secondly, from the pricing and listing conditions, we can see that most of the areas have more whole houses and single rooms. The price difference between the two is very large, so for different customers the company can take the precise marketing strategy to maximize the benefits.
Furthermore, by analyzing the most commented hotels we can conclude that single rooms occupy a leading position and are moderately priced, so that if the user focuses on cost-effective, he can consider the price as a baseline to search for properties, so that the probability of finding suitable properties is higher.
However, due to the inevitable limitations of the data, although the analysis is as comprehensive as possible, but there are still a lot of places that can be explored in depth. For example, from the point of view of Reviews, you can add the quality of reviews, including good and bad summary information to more accurately determine the hotel's good or bad, rather than relying solely on the number of reviews, because we can not rule out the existence of the water army. Including the user RFM score to determine which part of the site needs to retain customers, how to divide that part of the customer. The turnover of listings can also be analyzed using the classic AARRR to maximize the conversion rate of users.
So this report is more of an inspiration for us, because it is unrealistic to base a company or project's strategic decision on just one piece of data, without comparison or clearer and more detailed information. This is also something we should pay attention to on the road of data analysis afterward.
The links to the code used for the experiments reported above are below:
/twelve417/Airbnb-2019-NYV/tree/master
Thanks, guys.
If you have any new ideas for this report you can leave a comment, and I will add them to complete it if I have any new ideas in the course of study afterward.