You work for the online store Ice, which sells video games all over the world. User and expert reviews, genres, platforms (e.g. Xbox or PlayStation), and historical data on game sales are available from open sources. You need to identify patterns that determine whether a game succeeds or not. This will allow you to spot potential big winners and plan advertising campaigns.
In front of you is data going back to 2016. Let’s imagine that it’s December 2016 and you’re planning a campaign for 2017.
(The important thing is to get experience working with data. It doesn't really matter whether you're forecasting 2017 sales based on data from 2016 or 2027 sales based on data from 2026.)
The dataset contains the abbreviation ESRB. The Entertainment Software Rating Board evaluates a game's content and assigns an age rating such as Teen or Mature.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure
from scipy import stats as st
#games_df = pd.read_csv(r'C:\Users\RoG\Desktop\Yandex_Course\games.csv')
games_df = pd.read_csv(r'/datasets/games.csv')
pd.options.display.max_rows = 500
games_df.tail()
games_df.info()
games_df.describe()
print('NA:', len(games_df.query('NA_sales == 0')), 'EU:', len(games_df.query('EU_sales == 0')),
'JP:', len(games_df.query('JP_sales == 0')), 'Other:', len(games_df.query('Other_sales == 0')))
games_df.duplicated().sum()
The data contains 16715 rows, 11 columns. Most of the missing values are in columns with scores and rating (more than third of values are missing). Also year of release has some missing values, and two suspicious rows with missing names and genres. Columns with information about sales have 0's as missing values or zero sales in region. No duplicated row were found.
games_df.columns = games_df.columns.str.lower()
Dropping the two noname games:
noname = games_df.query('genre != genre')
games_df.drop(noname.index, inplace=True)
games_df.query('user_score == "tbd"')['year_of_release'].value_counts().sort_index()
games_df['user_score'] = games_df['user_score'].replace('tbd',np.NaN)
games_df['user_score'] = games_df['user_score'].astype(np.float64)
User score is measured in numbers 1 to 10 with decimals. I've replaced the "tbd" values with NaNs as we don't look forward to any data update soon, especially from the games released many years ago, so they are basically all Nans.
games_df.query('rating == rating')['rating'].value_counts()
Rating K-A is the same as E, we should change it to E:
games_df["rating"].replace({'K-A': 'E'}, inplace=True)
games_df.query('rating == rating')['rating'].value_counts()
games_df.query('year_of_release == "NaN"')
As we can see, some of the games contain year of release in their name, let's extract them and fix at least a couple of rows:
games_df['year_of_release'].isna().sum()
year_dict = (games_df.query('year_of_release != year_of_release')['name']
.str.extract('(\d+)').reset_index().set_index('index')[0].to_dict())
games_df['year_of_release'] = games_df.apply(lambda x: year_dict[x.name] if pd.isna(x['year_of_release'])
else x['year_of_release'], axis=1)
fltr = pd.to_numeric(games_df['year_of_release']) > 1979
games_df['year_of_release'].where(fltr, 0, inplace=True)
len(games_df.query('year_of_release == 0'))
games_df['year_of_release'] = games_df['year_of_release'].astype(np.int64)
I've decided to change the year column type to integer as it looks more proper for year column, without decimals.
Let's continue investigating the 'year_of_release' column. Let's check it for consistency, i.e. for every console the right years represented:
year_games_info = (games_df.query('year_of_release != 0').groupby(['platform'])['year_of_release']
.value_counts().sort_index().to_frame())
year_games_info.columns = ['count']
year_games_info.reset_index(level=[1], inplace=True)
year_games_info['diff'] = year_games_info['year_of_release'].diff().to_frame()
year_games_info
year_games_info = year_games_info.query('diff != 1')
year_games_info[year_games_info.index.duplicated(keep=False)].query('count != 1')[['year_of_release','diff']]
In the table above there are shown pairs of platform/difference between preivous game year of release:
Wrong value for DS should be deleted:
games_df.drop((games_df.loc[(games_df['year_of_release'] == 1985)
& (games_df['platform'] == "DS"), :]).index.item(), inplace=True)
Let's investigate ratings and scores:
len(games_df.query('rating != rating'))
have_rating = games_df.query('rating == rating').set_index('name')['rating'].to_dict()
games_df['rating'] = games_df.apply(lambda x: have_rating[x['name']] if x['name'] in have_rating.keys()
else x['rating'], axis=1)
len(games_df.query('rating != rating'))
I've been able to fill ~400 missing ratings by finding a game rating on different platform. Let's fill NaNs the same way for critic and user score:
len(games_df.query('user_score != user_score'))
have_uscore = games_df.query('user_score == user_score').set_index('name')['user_score'].to_dict()
games_df['user_score'] = games_df.apply(lambda x: have_uscore[x['name']] if x['name'] in have_uscore.keys()
else x['user_score'], axis=1)
len(games_df.query('user_score != user_score'))
len(games_df.query('critic_score != critic_score'))
have_cscore = games_df.query('critic_score == critic_score').set_index('name')['critic_score'].to_dict()
games_df['critic_score'] = games_df.apply(lambda x: have_cscore[x['name']] if x['name'] in have_cscore.keys()
else x['critic_score'], axis=1)
len(games_df.query('critic_score != critic_score'))
About a thousand values for each score were restored.
games_df['total_sales'] = games_df['na_sales'] + games_df['eu_sales'] + games_df['jp_sales'] + games_df['other_sales']
games_df.head()
Let's fill the missing values for critic/user scores with mean values for the same year/genre:
pivot_score = games_df.pivot_table(index='genre', columns='year_of_release', values=['critic_score', 'user_score'])
pivot_score
games_df['critic_score'] = (games_df.groupby(['genre',
'year_of_release'])['critic_score']
.transform(lambda group:
group.fillna(group.mean())))
games_df['user_score'] = (games_df.groupby(['genre',
'year_of_release'])['user_score']
.transform(lambda group:
group.fillna(group.mean())))
len(games_df.query('critic_score != critic_score'))
len(games_df.query('user_score != user_score'))
Almost all the missing values were filled.
Let's investigate rating/genre relations:
games_df.pivot_table(index='genre', columns='rating', values='name',
aggfunc='count').plot(kind='bar', style='o', figsize=(15,12))
games_df.query('genre == "Sports" and rating != rating').count()
games_df.query('genre == "Puzzle" and rating != rating').count()
games_df.query('genre == "Fighting" and rating != rating').count()
The three genres above can be safely filled with most popular rating:
games_df.loc[(games_df['genre'] == 'Sports'),'rating'] = games_df.loc[(games_df['genre'] == 'Sports'),'rating'].fillna('E')
games_df.loc[(games_df['genre'] == 'Puzzle'),'rating'] = games_df.loc[(games_df['genre'] == 'Puzzle'),'rating'].fillna('E')
games_df.loc[(games_df['genre'] == 'Fighting'),'rating'] = games_df.loc[(games_df['genre'] == 'Fighting'),'rating'].fillna('T')
Score columns missing vales were filled with mean per genre/year; rating for some genres was filled with the most popular rating for that genre. 'year_of_release' column was cleared if weird values and some NaN's restored.
games_df[games_df['year_of_release'] != 0]['year_of_release'].hist(bins='auto', figsize=(12,8))
On the histogram above we see the disdtribution of games released per year. It significally increases after the year 2000. It will be relevant to use only these years in the following research.
new_games_df = games_df.query('year_of_release >= 2000')
Let's take a look at sales on different platform:
sales_sum = (new_games_df.pivot_table(index='platform', values='total_sales', aggfunc='sum')
.reset_index().sort_values('total_sales', ascending=False))
sales_sum.plot(y='total_sales', x='platform', kind='bar', figsize=(12,8))
top_sales_platform = sales_sum.query('total_sales > 400')['platform'].tolist()
top_sales_platform
For these five platforms total sales more than twice higher than for others, with the most succesful PS2 pushing over 1.2 billion US dollars. Let's take a closer look:
new_games_df.query('platform == @top_sales_platform')
Almost a half of our data is for these 5 topsellers. Let's compare their sales per year on a graph:
top_sales_pivot = (new_games_df.query('platform == @top_sales_platform and year_of_release != 0')
.pivot_table(index=['platform','year_of_release'], values='total_sales', aggfunc='sum'))
years_list = top_sales_pivot.index.get_level_values(1).unique()
top_sales_pivot
figure(figsize=(12,8))
def build_graph(platform):
data = top_sales_pivot.loc[platform]
plt.plot(data.index, data.values)
for r in top_sales_platform:
build_graph(r)
from matplotlib.pyplot import figure
plt.legend(top_sales_platform)
plt.xticks(years_list, rotation=20)
plt.grid(True)
Some common parameters we can see from the graph: a lifespan of each popular platform is around 10 years, so platforms released before 2005, like PS2 ans DS have 0 sales now. Total sales peaking 4-5 years after release - in this research we should look at platforms released in 2012 to build prognosis for 2017.
release_df = games_df.query('year_of_release != 0')
release_pivot = release_df.pivot_table(index='platform',
values='year_of_release', aggfunc='min').reset_index()
rp = release_df.pivot_table(index=['platform', 'year_of_release'],
values='total_sales', aggfunc='sum').query('year_of_release != 0')
release_pivot['total_sales'] = release_pivot.apply(lambda x: rp.loc[(x['platform'],x['year_of_release'])]
['total_sales'], axis=1)
release_pivot.set_index('platform', inplace=True)
figure(figsize=(12,8))
def build_graph(platform):
data = release_pivot.loc[platform]
plt.plot(data['year_of_release'], data['total_sales'], 'ro')
plt.text(data['year_of_release']+0.1, data['total_sales']+1, platform, fontsize=9)
for r in release_pivot.index.get_level_values(0).unique():
build_graph(r)
plt.xticks(release_pivot['year_of_release'].unique(), rotation=20)
plt.grid(True)
As we see, the new consoles come out after 2000 with gaps 2-4 years, the minor consoles come out 1-2 years before the new versions of big players - Xbox and PlayStation. We should not expect the new consoles in 2017.
newest_df = new_games_df.query('year_of_release >= 2012')
newest_pivot = newest_df.pivot_table(index=['platform','year_of_release'], values='total_sales', aggfunc='sum')
years_list = newest_pivot.index.get_level_values(1).unique()
newest_pivot
figure(figsize=(12,8))
def bg(platform):
data = newest_pivot.loc[platform]
plt.plot(data.index, data.values)
for r in newest_pivot.index.get_level_values(0).unique():
bg(r)
from matplotlib.pyplot import figure
plt.legend(newest_pivot.index.get_level_values(0).unique())
plt.xticks(years_list, rotation=20)
plt.grid(True)
newest_mean = newest_pivot.query('year_of_release == 2016')['total_sales'].mean()
newest_pivot.query('year_of_release == 2016 and total_sales >= @newest_mean')
The most popular consoles in 2016 are Xbox One and Playstation 4. Although there were no sales rising in 2016, we should expcect the highest sales on these two consoles. Also 3DS had sales slightly above average in 2016, but it's lifespan is close to end, so we should expect it die out in a year or two, but for 2017 it still may be profitable.
grouped_by_platf = new_games_df.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
grouped_by_platf
order_by_platf = grouped_by_platf.groupby(['platform'])['total_sales'].sum().sort_values(ascending=False).reset_index()['platform']
order_by_platf
plt.figure(figsize=(12,8))
sns.boxplot(x='platform', y='total_sales', data=grouped_by_platf, order=order_by_platf)
There is a great difference in total sales between platforms. There are five leaders (PS2, X360, PS3, Wii, DS), all others are either minor platforms or new ones and still have to reach their peak, like PS4.
grouped_by_mean = new_games_df.groupby(['platform', 'year_of_release'])['total_sales'].mean().reset_index()
order_by_platf_mean = grouped_by_mean.groupby(['platform'])['total_sales'].mean().sort_values(ascending=False).reset_index()['platform']
plt.figure(figsize=(12,8))
sns.boxplot(x='platform', y='total_sales', data=grouped_by_mean, order=order_by_platf_mean, showmeans=True)
The difference between mean sales is not that great. The mean sales have different leaders, for instance PS4 in in the lead, which means it'll join the topsellers in a couple of years. Mean values are great for several outdated platforms like GameBoy, which was very populat at it's time. The newest platforms have yet to rich it's popularity.
Let's take a look at the most popular consoles of previous generation:
new_games_df.query('name == "Call of Duty: Modern Warfare 3"')
top_sales_platform = ['X360', 'PS3', 'PC', 'Wii']
We pick a topseller title and check on which consoles it came out, those should be the most popular for it's time.
Let's have a look at PS3 sales dependance of critic and user scores:
ps3_sales = new_games_df.query('platform == "PS3"')
ax = ps3_sales.plot(y='total_sales', x='critic_score', kind='scatter', alpha=0.5, figsize=(12,8), label='critic', ylim=(0,5))
We see the dots location is curved upwards with critic score rise, which means overall sales rise with critic score. Most of the games get 60-80 points from critics, most of them do not reach $1 million sales. For those who get more than 80 from critics, the sales are spread more and may reach several millions more easily.
Also if the game get less than 60 from critics, fat chance it'll have great sales.
ax = ps3_sales.plot(y='total_sales', x='user_score', kind='scatter', alpha=0.5, figsize=(12,8), label='user', ylim=(0,5))
The majority of the games get 6 to 8 points. User score/sales relations are more scattered, with more low scored titles between topsellers. It means that probably the games were not that good, but proper advertising made them topsellers.
corr_check = ps3_sales[['critic_score', 'user_score', 'na_sales', 'eu_sales', 'jp_sales', 'total_sales']]
corr_check.corr()
sns.heatmap(corr_check.corr())
On the correlation map above we see that critic score have more influence on total sales, than user scores (possibly because critics write their rewiews often before the game officially released, and users only after they bought the game). European sales are most influenced by critic reviews. User reviews have almost no effect on sales, the least is on users in North America.
top_on_all_platf= (new_games_df.query('platform == @top_sales_platform')
.groupby('name').count().sort_values(by='platform', ascending=False)).reset_index().head(20)['name'].to_list()
top_on_all_platf
Above we have the list of top 20 games that have been on all top platforms we take a look at.
Let's compare some of the top game titles sales for different consoles:
ps3_sales = new_games_df.query('platform == "PS3" and name == @top_on_all_platf')
x360_sales = new_games_df.query('platform == "X360" and name == @top_on_all_platf')
pc_sales = new_games_df.query('platform == "PC" and name == @top_on_all_platf')
wii_sales = new_games_df.query('platform == "Wii" and name == @top_on_all_platf')
ps3_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'], autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
pc_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'], autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
x360_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'], autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
wii_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'], autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
The list and order of top selling games for different platforms is about the same, with minor differences. But sales for the same game i=on different platforms were different, for example "Harry Potter and the Deadly Hallows part 1" had way more success on PS3, than on other platforms.
Let's have a look on sales distributions between genres:
new_games_df.pivot_table(index='genre', values='total_sales',
aggfunc='sum').sort_values('total_sales', ascending=False).plot(kind='bar', style='o', figsize=(15,12))
new_games_df.pivot_table(index='genre', values='total_sales',
aggfunc='sum').sort_values('total_sales', ascending=False).plot(y='total_sales', kind='pie', style='o', figsize=(10,7), legend=False)
The first three genres - Action, Sports and Shooter - are the most profitable, making a half of the sales. The more popular the genre, the more recreational and entertaining it is. The least popular ones, Strategy, Puzzle and Adventure might be less entertaining and require some thinking, planning and riddle solving. Also playing these games may require more spare time.
new_games_df.pivot_table(index='genre', values=['na_sales', 'eu_sales', 'jp_sales'],
aggfunc='sum').sort_values('na_sales', ascending=False).plot(kind='bar', style='o', figsize=(15,12))
The genre distribution between thebiggest markets looks even with a couple of exclusion: RPG's are extremely popular in Japan, it is the most profitable genre there. All the genres are more profitable there, than Shooter genre, which is in top-3 for NA and EU. Racing games are also unpopular in Japan.
In the analysis we should use the data from the year 2000 and later, when the industry started it's rapid growth and some game rules were changed.
We found the top 5 best selling platforms - PS2, X360, PS3, Wii and DS. The successful platform bring profit for 10 years, peaking at 5 years after the release and than the profit starts to fade.
According to this information, we pay attention to the platforms, released not earlier than 2012. There is no console on the market that increased the sales in 2016, but the aim should be on the top 3 leaders - Xbox One, PS4 and 3DS. As 3DS sale are fading, we may expect the release of new console from Nintendo, but not for other big players.
There is a correlation between critic ratings and sales, so the games with top scored reviews from critics should have higher sales, especially on European market. Same games on different consoles might have a different success.
The most profit bring Action, Sports and Shooter genres, but the situation on Japan market is different, the most profitable genre there is RPG., the least - Shooter.
def pivot_profile(reg, agr):
sales = reg + '_sales'
piv_t = new_games_df.pivot_table(index=agr, values=sales,
aggfunc='sum').sort_values(sales, ascending=False).head(5).reset_index()
return piv_t
reg_lst = ['na', 'eu', 'jp']
agr_lst = ['platform', 'genre', 'rating']
agr_df = pd.DataFrame()
reg_df = pd.DataFrame()
for r in reg_lst:
#del dfff
agr_df = pd.DataFrame()
for a in agr_lst:
agr_df = pd.concat([agr_df, pivot_profile(r, a)], axis=1)
agr_df.columns = ['by_platform', 'pl_sales', 'by_genre', 'genre_sales', 'by_rating', 'rating_sales']
agr_df['region'] = r
reg_df = pd.concat([reg_df, agr_df], axis=0)
reg_df
for r in reg_lst:
df = reg_df.query('region == @r')
df.plot(x='by_platform', y='pl_sales', kind='barh', figsize=(7,4), fontsize=10, legend=False)
plt.ylabel('')
plt.title(r.upper(), fontsize=15)
XBox 360 holds first place only in North America, for Europe it is still PS2 (about to be beaten by PS3 in 2017), and Xbox 360 is much less popular there. In Japan the market is completely different, only Japan-made consoles are in top 5. Sony and Nintendo dominate the japan market.
for r in reg_lst:
df = reg_df.query('region == @r')
df.plot(x='by_genre', y='genre_sales', kind='barh', figsize=(7,4),
fontsize=10, legend=False, color="orange")
plt.ylabel('')
plt.title(r.upper(), fontsize=15)
NA and EU markets are very similar with exception, that 5th place in Europe is Racing genre, not RPG as in NA. Japan market is different, the absolute winner there is RPG, and Shooter is not represented at all. Also, Platform genre is presented there on 5th position.
for r in reg_lst:
df = reg_df.query('region == @r')
df.plot(x='by_rating', y='rating_sales', kind='pie', figsize=(7,4),
labels=df['by_rating'], autopct='%1.1f%%', fontsize=10, legend=False)
plt.ylabel('')
plt.title(r.upper(), fontsize=15)
NA and EU have similar sales by rating, with rating M slightly bigger for Europe wiining it's second place. For Japan the distribution is different - almost a half is E rated games, a one third is for rating T, and tiny 12% for M rating games - looks like adults in Japan have very little time to play videogames.
na_s = new_games_df['na_sales'].sum()
eu_s = new_games_df['eu_sales'].sum()
jp_s = new_games_df['jp_sales'].sum()
df = [na_s, eu_s, jp_s]
df = pd.DataFrame(df)
df.columns = ['sales']
df['region'] = reg_lst
df.plot(x='region', y='sales', kind='pie', figsize=(7,4),
labels=df['region'], autopct='%1.1f%%', fontsize=10, legend=False)
In general, North America market and European market have only minor differences, most of them in popular consoles. Japan market is unique in all the parameters reviewed an should have individual approach. We also should keep in mind that NA market is almost twice bigger than EU, and Japan market is almost twice smaller, than EU.
xone_df = new_games_df.query('platform == "XOne"')
pc_df = new_games_df.query('platform == "PC" and year_of_release >= 2013')
xone_pivot = xone_df.pivot_table(index='year_of_release', values='user_score')
pc_pivot = pc_df.pivot_table(index='year_of_release', values='user_score')
print('Xbox One std:', xone_df['user_score'].std(), '\nPC std:', pc_df['user_score'].std())
We should compare the samem years for the two platform since PC exists many years and Xbox on was released in 2013.
We consider standart deviations for the two platforms are the same.
ax = xone_pivot.plot(style='-o', grid=True)
pc_pivot.plot(style='-o', grid=True, ax=ax)
ax.legend(['XOne', 'PC'])
User score average looks approximately the same for the two platforms, with only big difference from the release year for XBox One.
Let's compare the sample means for user score:
$H_{0}$ - the sample means have no difference.
$H_{a}$ - the sample means are different.
alpha = .05
results = st.stats.ttest_ind(xone_pivot['user_score'], pc_pivot['user_score'], equal_var=True)
print('p-value: ', results.pvalue)
if (results.pvalue < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
We have no ground to say that sample means for user score are different for XBox On and PC, they most probably are the same.
action_df = new_games_df.query('genre == "Action"')
sports_df = new_games_df.query('genre == "Sports"')
action_pivot = action_df.pivot_table(index='year_of_release', values='user_score')
sports_pivot = sports_df.pivot_table(index='year_of_release', values='user_score')
print('Action std:', action_df['user_score'].std(), '\nSports std:', sports_df['user_score'].std())
We may consider standart deviations equal.
ax = action_pivot.plot(style='-o', grid=True)
sports_pivot.plot(style='-o', grid=True, ax=ax)
ax.legend(['Action', 'Sports'])
The two graphs above do look different, but let's assume that the difference is just a sampling error.
Let's compare the sample means for user score:
$H_{0}$ - the sample means have no difference (the one in the graph is due to sampling error).
$H_{a}$ - the sample means are different.
alpha = .05
results = st.stats.ttest_ind(action_pivot['user_score'], sports_pivot['user_score'], equal_var=True)
print('p-value: ', results.pvalue)
if (results.pvalue < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
It looks like there is no big difference between average user scores for Action and Sport games. For the latest years the difference is bigger, but still might be because of sample error.
The data contained about 16000 entries about the video game sales for different platforms and markets. After the analysis conducted we may determine the possible top seller platforms and genres for each market. The sales for 3DS, Xbox One and PS4 look promising, but we should also expect a new console from Nintendo, as 3DS sales fade away. The top genres are Action, Shooter, Sports, RPG. For the japanese market everything is different, as they prefer Japan-made consoles and RPG is on the first row in genre chart there. We should pay attention to critic scores for games, as they directly correlate to sales, especially in Europe. No dependece between user ratings for different genres or consoles was proved.