The effect of weather on taxi services: forming hypotheses

You're working as an analyst for Zuber, a new ride-sharing company that's launching in Chicago. Your task is to find patterns in the available information. You want to understand passenger preferences and the impact of external factors on rides.

You'll study a database, analyze data from competitors, and test a hypothesis about the impact of weather on ride frequency.

Data Description

You have these two CSVs:

project_sql_result_01.csv. It contains the following data:

  • company_name: taxi company name
  • trips_amount: the number of rides for each taxi company on November 15-16, 2017.

project_sql_result_04.csv. It contains the following data:

  • dropoff_location_name: Chicago neighborhoods where rides ended
  • average_trips: the average number of rides that ended in each neighborhood in November 2017.

    For these two datasets you now need to:

  • import the files

  • study the data they contain
  • make sure the data types are correct
  • identify the top 10 neighborhoods in terms of drop-offs
  • make graphs: taxi companies and number of rides, top 10 neighborhoods by number of dropoffs
  • draw conclusions based on each graph and explain the results
In [84]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, ttest_rel

Exploratory data analysis


  • [x] the files were opened
  • [x] the files have been studied (the first rows printed, the info() method used)
  • [x] data types are checked for correctness
  • [x] top-10 neighbourhoods with the highest drop-off rates selected;
  • [x] the graph 'taxi companies and the number of rides' was plotted;
  • [x] the graph 'top-10 neighbourhoods and the number of rides' was plotted;
In [18]:
trips_amount = pd.read_csv('project_sql_result_01.csv')
dropoff_locations = pd.read_csv('project_sql_result_04.csv')
weather_trips = pd.read_csv('project_sql_result_07.csv')
In [24]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
company_name    64 non-null object
trips_amount    64 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
company_name trips_amount
0 Flash Cab 19558
1 Taxi Affiliation Services 11422
2 Medallion Leasin 10367
3 Yellow Cab 9888
4 Taxi Affiliation Service Yellow 9299
In [25]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
dropoff_location_name    94 non-null object
average_trips            94 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.5+ KB
dropoff_location_name average_trips
0 Loop 10727.466667
1 River North 9523.666667
2 Streeterville 6664.666667
3 West Loop 5163.666667
4 O'Hare 2546.900000
In [26]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
start_ts              1068 non-null object
weather_conditions    1068 non-null object
duration_seconds      1068 non-null float64
dtypes: float64(1), object(2)
memory usage: 25.1+ KB
start_ts weather_conditions duration_seconds
0 2017-11-25 16:00:00 Good 2410.0
1 2017-11-25 14:00:00 Good 1920.0
2 2017-11-25 12:00:00 Good 1543.0
3 2017-11-04 10:00:00 Good 2512.0
4 2017-11-11 07:00:00 Good 1440.0

Select top-10 neighbourhoods by the number of drop-offs.

In [30]:
top10_dropoff_locations = (dropoff_locations.sort_values(by = 'average_trips', ascending = False)
dropoff_location_name average_trips
0 Loop 10727.466667
1 River North 9523.666667
2 Streeterville 6664.666667
3 West Loop 5163.666667
4 O'Hare 2546.900000
5 Lake View 2420.966667
6 Grant Park 2068.533333
7 Museum Campus 1510.000000
8 Gold Coast 1364.233333
9 Sheffield & DePaul 1259.766667

Plot the graphs: taxi companies and the number of rides, top-10 neighbourhoods by the number of drop-offs.

In [73]:
plt.subplots(figsize = (15, 5))

#piechart for cab operators
report = trips_amount.copy().sort_values(by = 'trips_amount', ascending = False)
top10_companies = report.head(10)['company_name'].unique()
report.loc[~report['company_name'].isin(top10_companies), 'company_name'] = 'Other'
report = report.groupby('company_name').agg({'trips_amount': 'sum'}).reset_index()
plt.subplot(1, 2, 1).pie(report['trips_amount'], 
                         labels = report['company_name'],
plt.title('Taxi companies by % of total trips, Nov, 2017')

#bars top 10 dropoff locations
report = top10_dropoff_locations.copy()
plt.subplot(1, 2, 2).bar(report['dropoff_location_name'],
plt.title('Top 10 dropoff locations by average daily trips, Nov. 2017')
plt.ylabel('Avg daily trips')
plt.xticks(rotation = 45)

Draw conclusions for each of the graphs, provide arguments for the results.


  • The market leaders in the number of rides are Flash Cab (14%), Taxi Affiliation Services (8.3%) and Medallion Leasing. The remaining relatively large competitors enjoy the share of 5-7%. Smaller taxi companies put together made only 28% of all the trips during the month;
  • Loop, River North, Stretterville and West Loop make the top-neighbourhoods by the number of drop-offs. No wonder, these neighbourhoods comprise the downtown Chicago.

Testing hypotheses


  • [x] the hypothesis «On rainy Sundays the average duration of rides from Loop neighbourhood to O'Hare International Airport changes» was tested.
  • [x] explanations for "how did you form the null and alternative hypotheses" were provided
  • [x] explanation for "what criterion did you use for testing the hypotheses and why" was provided
  • [x] there are conclusions at each stage
  • [x] there's an overall conclusion

Let's make visual data mining:

In [83]:
good_weather_trip_durations = weather_trips.query('weather_conditions == "Good"')['duration_seconds']
bad_weather_trip_durations = weather_trips.query('weather_conditions == "Bad"')['duration_seconds']

sns.distplot(good_weather_trip_durations, label = 'Good', color = 'green')
sns.distplot(bad_weather_trip_durations, label = 'Bad', color = 'red')
plt.title("Trips from Loop to O'Hare.\nDuration distributions by weather conditions.")
label = plt.xlabel('Trip duration, seconds')

We can observe a significant rise in ride duration on rainy days. We'll check whether this observation doesn't come from a statistical error.

Let's form the hypotheses:

  • H0: the average ride duration on fine and on rainy days is the same;
  • H1: the average ride duration on fine and on rainy days is different.

We'll use t-test to test our hypotheses, as it allows testing the difference between two average values in two selections.

We'll choose 0.5 as an alpha level. Since we're testing only one hypothesis, there is no need in adjusting alpha level.

In [87]:
alpha = 0.05
p_val = ttest_ind(good_weather_trip_durations, 
                  equal_var = False).pvalue
if p_val <= alpha:
    print('P-value ({}) is below the significance level ({}). H0 should be rejected'.format(round(p_val, 2), alpha))
    print('P-value ({}) is above the significance level ({}). H0 should be rejected'.format(round(p_val, 2), alpha))
P-value (0.0) is below the significance level (0.05). H0 should be rejected


The statistical test demonstrates that the visual difference observed is not random. The average ride duration from Loop to O'Hare on rainy days is indeed different from the values on fine days.