The effect of weather on taxi services: forming hypotheses¶

You're working as an analyst for Zuber, a new ride-sharing company that's launching in Chicago. Your task is to find patterns in the available information. You want to understand passenger preferences and the impact of external factors on rides.

You'll study a database, analyze data from competitors, and test a hypothesis about the impact of weather on ride frequency.

Data Description¶

You have these two CSVs:

project_sql_result_01.csv. It contains the following data:

company_name: taxi company name
trips_amount: the number of rides for each taxi company on November 15-16, 2017.

project_sql_result_04.csv. It contains the following data:

dropoff_location_name: Chicago neighborhoods where rides ended
average_trips: the average number of rides that ended in each neighborhood in November 2017.

For these two datasets you now need to:
import the files
study the data they contain
make sure the data types are correct
identify the top 10 neighborhoods in terms of drop-offs
make graphs: taxi companies and number of rides, top 10 neighborhoods by number of dropoffs
draw conclusions based on each graph and explain the results

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, ttest_rel

Exploratory data analysis¶

Checklist:

[x] the files were opened
[x] the files have been studied (the first rows printed, the info() method used)
[x] data types are checked for correctness
[x] top-10 neighbourhoods with the highest drop-off rates selected;
[x] the graph 'taxi companies and the number of rides' was plotted;
[x] the graph 'top-10 neighbourhoods and the number of rides' was plotted;

trips_amount = pd.read_csv('project_sql_result_01.csv')
dropoff_locations = pd.read_csv('project_sql_result_04.csv')
weather_trips = pd.read_csv('project_sql_result_07.csv')

print(trips_amount.info())
trips_amount.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
company_name    64 non-null object
trips_amount    64 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
None

print(dropoff_locations.info())
dropoff_locations.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
dropoff_location_name    94 non-null object
average_trips            94 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
None

print(weather_trips.info())
weather_trips.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
start_ts              1068 non-null object
weather_conditions    1068 non-null object
duration_seconds      1068 non-null float64
dtypes: float64(1), object(2)
memory usage: 25.1+ KB
None

Select top-10 neighbourhoods by the number of drop-offs.

top10_dropoff_locations = (dropoff_locations.sort_values(by = 'average_trips', ascending = False)
                                            .head(10)
                          )
top10_dropoff_locations

Plot the graphs: taxi companies and the number of rides, top-10 neighbourhoods by the number of drop-offs.

plt.subplots(figsize = (15, 5))

#piechart for cab operators
report = trips_amount.copy().sort_values(by = 'trips_amount', ascending = False)
top10_companies = report.head(10)['company_name'].unique()
report.loc[~report['company_name'].isin(top10_companies), 'company_name'] = 'Other'
report = report.groupby('company_name').agg({'trips_amount': 'sum'}).reset_index()
plt.subplot(1, 2, 1).pie(report['trips_amount'], 
                         labels = report['company_name'],
                         autopct='%1.1f%%')
plt.title('Taxi companies by % of total trips, Nov, 2017')

#bars top 10 dropoff locations
report = top10_dropoff_locations.copy()
plt.subplot(1, 2, 2).bar(report['dropoff_location_name'],
                         report['average_trips'])
plt.title('Top 10 dropoff locations by average daily trips, Nov. 2017')
plt.xlabel('Neighborhood')
plt.ylabel('Avg daily trips')
plt.xticks(rotation = 45)
plt.tight_layout()

Draw conclusions for each of the graphs, provide arguments for the results.

Conclusion:

The market leaders in the number of rides are Flash Cab (14%), Taxi Affiliation Services (8.3%) and Medallion Leasing. The remaining relatively large competitors enjoy the share of 5-7%. Smaller taxi companies put together made only 28% of all the trips during the month;
Loop, River North, Stretterville and West Loop make the top-neighbourhoods by the number of drop-offs. No wonder, these neighbourhoods comprise the downtown Chicago.

Testing hypotheses¶

Checklist:

[x] the hypothesis «On rainy Sundays the average duration of rides from Loop neighbourhood to O'Hare International Airport changes» was tested.
[x] explanations for "how did you form the null and alternative hypotheses" were provided
[x] explanation for "what criterion did you use for testing the hypotheses and why" was provided
[x] there are conclusions at each stage
[x] there's an overall conclusion

Let's make visual data mining:

good_weather_trip_durations = weather_trips.query('weather_conditions == "Good"')['duration_seconds']
bad_weather_trip_durations = weather_trips.query('weather_conditions == "Bad"')['duration_seconds']

sns.distplot(good_weather_trip_durations, label = 'Good', color = 'green')
sns.distplot(bad_weather_trip_durations, label = 'Bad', color = 'red')
plt.legend()
plt.title("Trips from Loop to O'Hare.\nDuration distributions by weather conditions.")
label = plt.xlabel('Trip duration, seconds')

We can observe a significant rise in ride duration on rainy days. We'll check whether this observation doesn't come from a statistical error.

Let's form the hypotheses:

H0: the average ride duration on fine and on rainy days is the same;
H1: the average ride duration on fine and on rainy days is different.

We'll use t-test to test our hypotheses, as it allows testing the difference between two average values in two selections.

We'll choose 0.5 as an alpha level. Since we're testing only one hypothesis, there is no need in adjusting alpha level.

alpha = 0.05
p_val = ttest_ind(good_weather_trip_durations, 
                  bad_weather_trip_durations, 
                  equal_var = False).pvalue
print(p_val)
if p_val <= alpha:
    print('P-value ({}) is below the significance level ({}). H0 should be rejected'.format(round(p_val, 2), alpha))
else:
    print('P-value ({}) is above the significance level ({}). H0 should be rejected'.format(round(p_val, 2), alpha))

6.738994326108734e-12
P-value (0.0) is below the significance level (0.05). H0 should be rejected

Conclusion:

The statistical test demonstrates that the visual difference observed is not random. The average ride duration from Loop to O'Hare on rainy days is indeed different from the values on fine days.

	company_name	trips_amount
0	Flash Cab	19558
1	Taxi Affiliation Services	11422
2	Medallion Leasin	10367
3	Yellow Cab	9888
4	Taxi Affiliation Service Yellow	9299

	start_ts	weather_conditions	duration_seconds
0	2017-11-25 16:00:00	Good	2410.0
1	2017-11-25 14:00:00	Good	1920.0
2	2017-11-25 12:00:00	Good	1543.0
3	2017-11-04 10:00:00	Good	2512.0
4	2017-11-11 07:00:00	Good	1440.0

	dropoff_location_name	average_trips
0	Loop	10727.466667
1	River North	9523.666667
2	Streeterville	6664.666667
3	West Loop	5163.666667
4	O'Hare	2546.900000
5	Lake View	2420.966667
6	Grant Park	2068.533333
7	Museum Campus	1510.000000
8	Gold Coast	1364.233333
9	Sheffield & DePaul	1259.766667