You're working as an analyst for Zuber, a new ride-sharing company that's launching in Chicago. Your task is to find patterns in the available information. You want to understand passenger preferences and the impact of external factors on rides.
You'll study a database, analyze data from competitors, and test a hypothesis about the impact of weather on ride frequency.
You have these two CSVs:
project_sql_result_01.csv
. It contains the following data:
project_sql_result_04.csv
. It contains the following data:
average_trips: the average number of rides that ended in each neighborhood in November 2017.
For these two datasets you now need to:
import the files
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, ttest_rel
Checklist:
trips_amount = pd.read_csv('project_sql_result_01.csv')
dropoff_locations = pd.read_csv('project_sql_result_04.csv')
weather_trips = pd.read_csv('project_sql_result_07.csv')
print(trips_amount.info())
trips_amount.head(5)
print(dropoff_locations.info())
dropoff_locations.head(5)
print(weather_trips.info())
weather_trips.head(5)
Select top-10 neighbourhoods by the number of drop-offs.
top10_dropoff_locations = (dropoff_locations.sort_values(by = 'average_trips', ascending = False)
.head(10)
)
top10_dropoff_locations
Plot the graphs: taxi companies and the number of rides, top-10 neighbourhoods by the number of drop-offs.
plt.subplots(figsize = (15, 5))
#piechart for cab operators
report = trips_amount.copy().sort_values(by = 'trips_amount', ascending = False)
top10_companies = report.head(10)['company_name'].unique()
report.loc[~report['company_name'].isin(top10_companies), 'company_name'] = 'Other'
report = report.groupby('company_name').agg({'trips_amount': 'sum'}).reset_index()
plt.subplot(1, 2, 1).pie(report['trips_amount'],
labels = report['company_name'],
autopct='%1.1f%%')
plt.title('Taxi companies by % of total trips, Nov, 2017')
#bars top 10 dropoff locations
report = top10_dropoff_locations.copy()
plt.subplot(1, 2, 2).bar(report['dropoff_location_name'],
report['average_trips'])
plt.title('Top 10 dropoff locations by average daily trips, Nov. 2017')
plt.xlabel('Neighborhood')
plt.ylabel('Avg daily trips')
plt.xticks(rotation = 45)
plt.tight_layout()
Draw conclusions for each of the graphs, provide arguments for the results.
Conclusion:
Checklist:
Let's make visual data mining:
good_weather_trip_durations = weather_trips.query('weather_conditions == "Good"')['duration_seconds']
bad_weather_trip_durations = weather_trips.query('weather_conditions == "Bad"')['duration_seconds']
sns.distplot(good_weather_trip_durations, label = 'Good', color = 'green')
sns.distplot(bad_weather_trip_durations, label = 'Bad', color = 'red')
plt.legend()
plt.title("Trips from Loop to O'Hare.\nDuration distributions by weather conditions.")
label = plt.xlabel('Trip duration, seconds')
We can observe a significant rise in ride duration on rainy days. We'll check whether this observation doesn't come from a statistical error.
Let's form the hypotheses:
We'll use t-test to test our hypotheses, as it allows testing the difference between two average values in two selections.
We'll choose 0.5 as an alpha level. Since we're testing only one hypothesis, there is no need in adjusting alpha level.
alpha = 0.05
p_val = ttest_ind(good_weather_trip_durations,
bad_weather_trip_durations,
equal_var = False).pvalue
print(p_val)
if p_val <= alpha:
print('P-value ({}) is below the significance level ({}). H0 should be rejected'.format(round(p_val, 2), alpha))
else:
print('P-value ({}) is above the significance level ({}). H0 should be rejected'.format(round(p_val, 2), alpha))
Conclusion:
The statistical test demonstrates that the visual difference observed is not random. The average ride duration from Loop to O'Hare on rainy days is indeed different from the values on fine days.