Yandex.Realty has an archive of property ads for St. Petersburg and surrounding areas covering the past few years. Your mission is to figure out how to determine the market value of real estate. You'll need to define parameters that make it possible to create an automated system capable of detecting anomalies and fraud.
There are two types of data available for every apartment for sale. The first is user input. The second was calculated automatically based on map data. This includes the distance from the city center, the airport, and the nearest park or body of water.
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
# read all input data
data_real_estate = pd.read_csv("real_estate_data_us.csv", sep='\t')
# brief information about input data
data_real_estate.info()
data_real_estate.head()
There are 23699 rows and 22 columns. Some general points without close inspection:
data_real_estate['date_posted'] = pd.to_datetime(data_real_estate['date_posted'], format='%Y-%m-%dT%H:%M:%S')
a. Use fillna() with, for instance, 0-value; zeros aren't used here, so it would be unique; then use astype('int').
b. convert None-values to np.nan; then use astype('int').
I choose 4a.data_real_estate['days_listed'].fillna(value=0, inplace=True)
print('Number of zeros is {:}'.format(data_real_estate[data_real_estate['days_listed'] == 0]['days_listed'].count()))
# convert to integer
data_real_estate['days_listed'] = data_real_estate['days_listed'].astype('int')
There are a little number of empty cells for these columns.
I could replace None-values with some median number for group divided by total_area. Why total_area? Apparantly there should be a good correlation between total_area and kitchen_area, living_area. The larger the area of part of the apartment, the larger the area of the entire apartment.
# let's see is correlation is good or not
plt.figure()
x = data_real_estate.living_area
y = data_real_estate.total_area
plt.plot(x,y,'.')
plt.xlabel('living_area')
plt.ylabel('total_area')
plt.figure()
x = data_real_estate.kitchen_area
y = data_real_estate.total_area
plt.plot(x,y,'.')
plt.xlabel('kitchen_area')
plt.ylabel('total_area')
We can see that figure total_area from living_area is not so scattered. It is narrower than function from living_area.
I think the best way is to find a linear function which gives the best approximation. Thus one can find a good tool to recover none-values in living and kitchen rooms. But at the moment we didn't study this. So... I can just create groups in living_area and find median in each groups.
# determine a function to use apply() later to create new cathegory of area
def fun_area_cathegory(row):
area = row.total_area
step = (max_total_area - 0.) / num_area_category
for k in range(num_area_category):
if k*step<= area < (k+1)*step:
return 'area_' + str(k)
return 'area_' + str(num_area_category)
# create a new area_category describing total_area
num_area_category = 20
max_total_area = max(data_real_estate['total_area'])
data_real_estate['area_category'] = data_real_estate.apply(fun_area_cathegory, axis=1)
One can see that there is a serious disadvantage of this approach. Because data are so scattered, we had not detailed solution in the beginning close to zero. I can (a) increase number of area category or (b) do step less closer to the zero. But I think the solution is good enough.
Another thing is that I used two global parameters num_area_category and max_total_area in function. I don't understand is it OK? or I should find a way to avoid using them.
# save median values for each group for living area
living_area_cathegory_median = data_real_estate.groupby('area_category')['living_area'].median().sort_values()
# define function to use apply() later to update area of living room
def area_living_update(row):
area = row['living_area']
area_cathegory = row['area_category']
if not(pd.isnull(area)):
return area
else:
for k in range(num_area_category):
if area_cathegory == living_area_cathegory_median.index[k]:
return living_area_cathegory_median[k]
# update column if living area
data_real_estate['living_area'] = data_real_estate.apply(area_living_update, axis=1)
# save median values for each group for kitchen area
kitchen_area_cathegory_median = data_real_estate.groupby('area_category')['kitchen_area'].median().sort_values()
# define function to use apply() later to update kitchen area
def area_kitchen_update(row):
area = row['kitchen_area']
area_cathegory = row['area_category']
if not(pd.isnull(area)):
return area
else:
for k in range(num_area_category):
if area_cathegory == kitchen_area_cathegory_median.index[k]:
return kitchen_area_cathegory_median[k]
# update column if living area
data_real_estate['kitchen_area'] = data_real_estate.apply(area_kitchen_update, axis=1)
# if the user doesn’t enter the number of balconies, then there probably aren’t any.
# The correct course of action here is to replace these missing values with 0.
data_real_estate['balconies'].fillna(0, inplace=True)
# convert to integer
data_real_estate['balconies'] = data_real_estate['balconies'].astype('int')
Ceiling height doesn't strong depend on other parameters. Approximately 50% of data is abscent. I will not update this column. There is no any good reason to do update.
data_real_estate.ceiling_height.unique()
print('Number of unique values: {:}'.format(len(data_real_estate.ceiling_height.unique())))
# show unique values
data_real_estate['floors_total'].unique()
# I don't see zeros. I will temporarily fillna() with zeroes. I want to do it to convert float values with integer.
data_real_estate['floors_total'].fillna(0, inplace=True)
# convert to integer
data_real_estate['floors_total'] = data_real_estate['floors_total'].astype('int')
# show all None (now zeros) values, maybe I will see how to update this column
data_real_estate.query('floors_total == 0').head(10)
Nothing special I can see here. It is hard to reveal some dependences using only data-table. Maybe later I will see something interesting.
Nothing to do. There is no some data. But I can't just fill None by random values.
print(data_real_estate['parks_within_3000'].unique())
print(data_real_estate['ponds_within_3000'].unique())
# I don't see 10. I will temporarily fillna() with 10. I want to do it to convert float values with integer.
data_real_estate['parks_within_3000'].fillna(10, inplace=True)
data_real_estate['ponds_within_3000'].fillna(10, inplace=True)
data_real_estate['parks_within_3000'] = data_real_estate['parks_within_3000'].astype('int')
data_real_estate['ponds_within_3000'] = data_real_estate['ponds_within_3000'].astype('int')
data_real_estate[data_real_estate['parks_within_3000'] == 10]['parks_within_3000'] = None
data_real_estate[data_real_estate['ponds_within_3000'] == 10]['ponds_within_3000'] = None
FAIL! While there is None, this column can't be integer.
I noticed that there are several appartements with 0-values in bedrooms column
I think that I can do something with this data.
data_real_estate.query('bedrooms == 0').head()
# I have noticed that total_area is rather small
# build a histogram
data_real_estate.query('bedrooms == 0')['total_area'].hist(bins=30)
Really most ot this data has relatively small total area.
# group by number of bedrooms and define median total area in each group
bedrooms_area_const = data_real_estate.query('bedrooms != 0').groupby('bedrooms')['total_area'].median()
print(bedrooms_area_const)
bedrooms_area_const[1] * 1.1
# As expected appartments with only 1-bedroom has small total area below 40 m2
# Now I have foundation to do update
def update_bedrooms(row):
beds = row['bedrooms']
area = row['total_area']
if beds != 0:
return beds
else:
if area < (bedrooms_area_const[1] + bedrooms_area_const[2]) / 2:
return 1
elif area < (bedrooms_area_const[2] + bedrooms_area_const[3]) / 2:
return 2
elif area < (bedrooms_area_const[3] + bedrooms_area_const[4]) / 2:
return 3
elif area < (bedrooms_area_const[4] + bedrooms_area_const[5]) / 2:
return 4
return 0
data_real_estate['bedrooms'] = data_real_estate.apply(update_bedrooms, axis=1)
data_real_estate.query('bedrooms == 0')
Last appartement has 371 m2. In fact it is hard to understand how many bedrooms are in it. But it is the only value. Maybe it will not be a hard mistake to assign some random value.
data_real_estate.loc[19392, 'bedrooms'] = 19
I have changed some types and fill in some columns None-values.
# the price per square meter
data_real_estate['price2sq'] = data_real_estate['last_price'] / data_real_estate['total_area']
# the day of the week, month, and year that the ad was published
data_real_estate['day_of_week'] = data_real_estate['date_posted'].dt.weekday
data_real_estate['month'] = data_real_estate['date_posted'].dt.month
data_real_estate['year'] = data_real_estate['date_posted'].dt.year
# which floor the apartment is on (first, last, or other)
def fun_floor_cathegory(row):
floor = row['floor']
total_floor = row['floors_total']
if floor == 1:
return 'first'
elif floor == total_floor:
return 'last'
else:
return 'other'
data_real_estate['floor_cathegory'] = data_real_estate.apply(fun_floor_cathegory, axis=1)
# the ratio between the living space and the total area, as well as between the kitchen space and the total area
data_real_estate['ratio_living_area'] = data_real_estate['living_area'] / data_real_estate['total_area']
data_real_estate['ratio_kitchen_area'] = data_real_estate['kitchen_area'] / data_real_estate['total_area']
data_real_estate.head(5)
I have created new parameters to analyze them below.
Carefully investigate the following parameters: square area, price, number of rooms, and ceiling height. Plot a histogram for each parameter.
# investigate total_area; histogram of total_area
data_real_estate['total_area'].hist(bins=100)
# I see that there is extremely high value of total area
# find top-10 high values
data_real_estate['total_area'].sort_values().tail(10)
data_real_estate.sort_values('total_area').tail(10)
It seems like these high values are correct. It is not a typo because price per square is vary from 1900 to 9500 and it is OK.
# Let's see how many appartments under 200 m^2
print('Rate of appartments with area under 200m2 {:.2f}'.format(
data_real_estate.query('total_area < 200').shape[0] / data_real_estate.shape[0]))
# let's look detailed histogram of total_area excluding extremely values
data_real_estate['total_area'].hist(bins=20, range=(0,200))
Histogram looks like Poisson distribution. There is a high peak at the begining and long tail.
It is not a surprise, because total_area is a descrete parameter and we record data for some period of time.
Most of values are in range (37.5, 50.0]. I don't see extraordinary values.
# let's look detailed histogram of total_area excluding extremely values
data_real_estate['total_area'].hist(bins=200, range=(0,200))
It is not a horror to find this SAW-style histogram. Houses are of a typical construction and appartments has typical areas.
It seems like extremely high values are correct. Price per square of this appartments is vary from 1900 to 9500 and it is OK.
Overwhelming majority (99%) of appartments have total area under 200m2.
Histogram looks like Poisson distribution. Most of values are in range (37.5, 50.0].
# investigate a price
data_real_estate['last_price'].hist(bins=100)
print('Maximum price is:',data_real_estate['last_price'].max())
print('Appartment with the highest total area has price:',
data_real_estate.sort_values('total_area').tail(1)['last_price'].values[0])
# It is unbelievable that maximum price nearly two times higher than the price of apartment with the greatest total area
data_real_estate.sort_values('last_price').tail()
# I think that real price is 1 526 000 rather than 15 260 000.
print('Price per square for this extra expensive appartment is :',
data_real_estate.sort_values('last_price').tail(1)['price2sq'].values[0],
'. It is too high.')
print('Median value Price per square is:',data_real_estate['price2sq'].median())
print('Average value Price per square is: {:.1f}'.format(data_real_estate['price2sq'].mean()))
data_real_estate['last_price'].hist(bins=100, range=[0, 1e6])
data_real_estate['last_price'].hist(bins=100, range=[0, 4e5])
I think I should clear data with this extraordinary prices. The best way to filter data by price per square
# number of rooms
print('Maximum number of rooms is:',data_real_estate['bedrooms'].max())
data_real_estate.sort_values('bedrooms').tail()
# histogram
data_real_estate['bedrooms'].hist(bins=19,range=[0,19])
# histogram
data_real_estate['bedrooms'].hist(bins=10,range=[0,10])
I don't see anything special. The most representative number of rooms are 1, 2 and 3. I expected to see such a result.
# ceiling height
print('Minimum ceiling height is:',data_real_estate['ceiling_height'].min())
I know this guys https://en.wikipedia.org/wiki/Hobbit :) But I think they live in New Zeland not Saint Petersburg. It must be a mistake or typo.
print('Maximum ceiling height is:',data_real_estate['ceiling_height'].max())
Without any doubts ents found their houses. https://en.wikipedia.org/wiki/Ent. :) It is a mistake or a typo.
print('Median ceiling height:',data_real_estate['ceiling_height'].median())
print('Average ceiling height: {:.2f}'.format(data_real_estate['ceiling_height'].mean()))
data_real_estate.query('ceiling_height == ceiling_height').sort_values('ceiling_height', ascending=False).head()
More likely 27.5 could be 2.75; 27.0 could be 2.7. It is True especially since these appartments has typical total area for 1-room appartment.
I will update this data using apply(). This way is easier to make an automatical one.
def ceiling_height_update(row):
ceiling_height = row['ceiling_height']
if ceiling_height == 32.0:
return 3.20
elif ceiling_height == 27.5:
return 2.75
elif ceiling_height == 27.0:
return 2.70
elif ceiling_height == 26.5:
return 2.65
elif ceiling_height == 26.0:
return 2.60
elif ceiling_height == 25.0:
return 2.50
elif ceiling_height == 24.0:
return 2.40
else:
return ceiling_height
data_real_estate['ceiling_height']=data_real_estate.apply(ceiling_height_update, axis=1)
data_real_estate.query('ceiling_height == ceiling_height').sort_values('ceiling_height', ascending=False).head()
22.6, 20.0, 14.0, 10.3 are mistakes too. But it is hard to determine correct value.
Maybe it is better divide by groups total_area and then determine approximate values.
# histogram
data_real_estate['ceiling_height'].hist(bins=10,range=[2,3.5])
I corrected some apparently typos such as 27.5 -> 2.75.
I found extremely strange values such as 100m of ceiling_height, where total area is only 25.0 m2 and price overwhelmed 300000.0.
Also I found extremly small values.
# we remember that zeros-values are None
print('We have only {:.2f} of all data with not empty days_listed'.format(
len(data_real_estate[data_real_estate['days_listed'] != 0]) /
len(data_real_estate)))
print('Minimal days listed:',data_real_estate[data_real_estate['days_listed'] != 0]['days_listed'].min())
print('Maximum days listed:',data_real_estate[data_real_estate['days_listed'] != 0]['days_listed'].max())
# so, minimal day when appartment was in list is 1
# build a histogram from 1 to exclude None-values
data_real_estate['days_listed'].hist(
range=[data_real_estate[data_real_estate['days_listed'] != 0]['days_listed'].min(),
data_real_estate[data_real_estate['days_listed'] != 0]['days_listed'].max()],
bins=30
)
As I can see approximately 6500+4500+2500+1500=15000 appartments from 23700 ones have been listed (more likely have been selled) for 200 days. It is ~ 63%.
print('Median days listed:', data_real_estate.query('0 < days_listed')['days_listed'].median())
print('Average days listed:', data_real_estate.query('0 < days_listed')['days_listed'].mean())
Average days nearly twice higher than mean-value. That means that there is a long and/or hard tail in distribution of days_listed.
It is interesting to reveal why some appartments have been selled quicker than others.
I think that several factors could play the role: price and number of rooms. Let's check it
# price should be the main factor. At first let's check globally without dividing by groups.
# select apartments sold quickly (50%) and find average and mean values
print('---quick sold appartments---')
print('Median last price for quick sold appartments:',
data_real_estate.query('0 < days_listed <= 95')['last_price'].median())
print('Average last price for quick sold appartments:',
data_real_estate.query('0 < days_listed <= 95')['last_price'].mean())
# to select apartments sold slowly (50%)
print('---slow sold appartments---')
print('Median last price for quick sold appartments:',
data_real_estate.query('days_listed > 95')['last_price'].median())
print('Average last price for quick sold appartments:',
data_real_estate.query('days_listed > 95')['last_price'].mean())
So the difference is about 8%. It is a great number, when we deal with appartments.
BUT! This contrast could be explained by fact, that bigger appartments could be sold slower. And of course they are more expensive.
Let's see how many bedrooms in these appartments.
# group appartements by number of bedrooms
data_real_estate.query('0 < days_listed <= 95 and bedrooms > 0').groupby('bedrooms')['days_listed'].count()
# It is better to desclear a new cathegory: rooms_cat. It will include '1-room', '2-room', '3-room', 'many-room'
def fun_rooms_cathegory(row):
room = row['bedrooms']
if room == 1:
return '1-room'
elif room == 2:
return '2-room'
elif room == 3:
return '3-room'
elif room > 3:
return 'many-room'
else:
return None
data_real_estate['rooms_cat'] = data_real_estate.apply(fun_rooms_cathegory, axis=1)
# so number of appartements in each groups
data_real_estate.query('0 < days_listed <= 95 and bedrooms > 0').groupby('rooms_cat')['days_listed'].count()
print('Share of quick sold appartements:',
data_real_estate.query('0 < days_listed <= 95 and bedrooms > 0').groupby('rooms_cat')['days_listed'].count() /
data_real_estate.query('0 < days_listed <= 95 and bedrooms > 0').groupby('rooms_cat')['days_listed'].count().sum() * 100)
# the similar table for slow sold appartements
print('Share of slow sold appartements:',
data_real_estate.query('days_listed > 95').groupby('rooms_cat')['days_listed'].count() /
data_real_estate.query('days_listed > 95').groupby('rooms_cat')['days_listed'].count().sum() * 100)
# take look of detailed price table for quick sold appartements
print('Price of quick sold appartements:',
data_real_estate.query('0 < days_listed <= 95 and bedrooms > 0').groupby('rooms_cat')['last_price'].median())
# the same for slow sold appartements
print('Price of quick sold appartements:',
data_real_estate.query('days_listed > 95 and bedrooms > 0').groupby('rooms_cat')['last_price'].median())
As I expected, share of 1-room appartements in quick group higher than in slow group. But it was mistake to think that price of 1-room appartements is different. Median values are the same. At the same time many-rooms appartement which are higher in price are sold slower.
# what about super fast appartments?
# 2 weeks is really fast
print('Price of quick sold appartements:',
data_real_estate.query('0 < days_listed <= 14 and bedrooms > 0').groupby('rooms_cat')['last_price'].median())
# what about super slow appartments?
# 180 days (~half a year) is really slow
print('Price of quick sold appartements:',
data_real_estate.query('days_listed > 180').groupby('rooms_cat')['last_price'].median())
So total price doesn't rule the game. Maybe it is better to look through price per meter?
# what about super fast appartments?
# 2 weeks is really fast
print('Price of quick sold appartements:',
data_real_estate.query('0 < days_listed <= 14 and bedrooms > 0').groupby('rooms_cat')['price2sq'].median())
# what about super fast appartments?
# 180 days (~half a year) is really slow
print('Price of quick sold appartements:',
data_real_estate.query('days_listed > 180').groupby('rooms_cat')['price2sq'].median())
Surprising price per meter doesn't influence on how fast appartment could be sold.
What if I get this result due to inflation?
# what about super fast appartments? group by year
# 2 weeks is really fast
print('Price of quick sold appartements:',
data_real_estate.query('0 < days_listed <= 14 and bedrooms > 0').groupby('year')['price2sq'].median())
# all appartements and all years
print('Price of quick sold appartements:',
data_real_estate.groupby('year')['price2sq'].median())
It is a magic. In fact this quick sold appartements have even higher price! Maybe they have a good location and people are really want to buy them.
Conclusion:
I found that price doesn't hardly influence on how fast appartements have been sold. I found that 1-room appartements have been sold faster. Many-rooms appartements in its turn slower
Am I right that I am only now remove rare data? I don't understand why I should remove them. I don't see that this data has a problem like it was in gas stations. I do it only because it is in project.
The one reason is to study typical appartments.
# how many of appartments have been sold too quickly? is there anomalies?
data_real_estate['days_listed'].hist(
range=[data_real_estate[data_real_estate['days_listed'] != 0]['days_listed'].min(), 30],
bins=30
)
No, there is no any anomalies in the head. It is OK
# I will cut tail of days listed
# mean and standard deviation
print('Average days listed:', data_real_estate['days_listed'].mean())
print('Standard deviation days listed:', data_real_estate['days_listed'].std())
mean_days_listed = data_real_estate['days_listed'].mean()
std_days_listed = data_real_estate['days_listed'].std()
# 3-sigma filter
data_real_estate_filt_days = data_real_estate.query('days_listed < @mean_days_listed + 3*@std_days_listed')
print('Share of deleted rows (%):', (len(data_real_estate) - len(data_real_estate_filt_days)) / len(data_real_estate) * 100)
# outlying values: bedrooms
data_real_estate_filt_days['bedrooms'].hist()
# filter bedrooms with 5+ beds
data_real_estate_filt_beds = data_real_estate_filt_days.query('bedrooms <= 5')
print('Share of deleted rows (%):',
(len(data_real_estate_filt_days) - len(data_real_estate_filt_beds)) / len(data_real_estate_filt_days) * 100)
# outlying values: price per square
data_real_estate_filt_beds['price2sq'].hist(bins=30)
# filter bedrooms with 5000 per square
data_real_estate_filt_price = data_real_estate_filt_beds.query('price2sq <= 5000')
print('Share of deleted rows (%):',
(len(data_real_estate_filt_beds) - len(data_real_estate_filt_price)) / len(data_real_estate_filt_beds) * 100)
data_real_estate_final = data_real_estate_filt_price
I have filtered outliers. Thus I can analyse typical data.
Examine whether the value depends on the total square area, number of rooms, floor (top or bottom), or the proximity to the city center area. Also check whether the publication date has any effect on the price: specifically, day of the week, month, and year. Note that using scatter plot is preferable to hexbin. If you do decide to use hexbin, please use scatter plot too, and then compare the results. It is also recommended to check the hexbin documentation and carefully study its parameters.
I've analysed some factors: total area and number of rooms. I don't think that result have been changed.
data_real_estate_final.plot(x='total_area', y='last_price', style='.', grid=True, legend=True, alpha=0.1)
plt.xlabel('total_area')
plt.ylabel('last_price')
Greater total_area, than greater last_price. As we filtered data, we see it on plot
data_real_estate_final.plot(x='bedrooms', y='last_price', style='o', grid=True, legend=True, alpha=0.003)
plt.xlabel('bedrooms')
plt.ylabel('last_price')
Results are the same as previous
# floor
data_real_estate_final['floor'].hist()
data_real_estate_final.plot(x='floor', y='last_price', style='o', grid=True, legend=True, alpha=0.01)
plt.xlabel('floor')
plt.ylabel('last_price')
Number of Floor doesn't influence on price so strongly as total area.
data_real_estate_final.query('bedrooms == 1').groupby('floor_cathegory')['last_price'].median()
data_real_estate_final.groupby('floor_cathegory')['last_price'].median()
So, last and the first floors has lower last price than others.
# city_center_dist
data_real_estate_final.plot(x='city_center_dist', y='last_price', style='o', grid=True, legend=True, alpha=1)
plt.xlabel('city_center_dist')
plt.ylabel('last_price')
print(data_real_estate_final['city_center_dist'].corr(data_real_estate_final['last_price']))
So, there are no any extremely high priced appartements far from the city center. And I see negative correlation between these parameters.
# day_of_week
data_real_estate_final.plot(x='day_of_week', y='last_price', style='o', grid=True, legend=True, alpha=1)
plt.xlabel('day_of_week')
plt.ylabel('last_price')
print(data_real_estate_final['day_of_week'].corr(data_real_estate_final['last_price']))
Day of week does not influence at all
data_real_estate_final.plot(x='month', y='last_price', style='o', grid=True, legend=True, alpha=1)
plt.xlabel('month')
plt.ylabel('last_price')
print(data_real_estate_final['month'].corr(data_real_estate_final['last_price']))
data_real_estate_final.plot(x='year', y='last_price', style='o', grid=True, legend=True, alpha=1)
plt.xlabel('year')
plt.ylabel('last_price')
print(data_real_estate_final['year'].corr(data_real_estate_final['last_price']))
data_real_estate_final.groupby('year')['last_price'].median()
After crysis people can't pay high price for appartement. Inflation every year after 2016 increase cost of appartement.
Finally. I have cleared data from mistakes and outliers. I have changed type of some columns.
Price of appartements mostly depends on: total area, is appartement on the first/last floor or not and distance from the center. I have created some pivot tables and plots to show it. From year to year price of appartements changed.
It could be interesting to make more detailed investigation in each year.