Video games sales analysis

Project Description

You work for the online store Ice, which sells video games all over the world. User and expert reviews, genres, platforms (e.g. Xbox or PlayStation), and historical data on game sales are available from open sources. You need to identify patterns that determine whether a game succeeds or not. This will allow you to spot potential big winners and plan advertising campaigns.

In front of you is data going back to 2016. Let’s imagine that it’s December 2016 and you’re planning a campaign for 2017.

(The important thing is to get experience working with data. It doesn't really matter whether you're forecasting 2017 sales based on data from 2016 or 2027 sales based on data from 2026.)

The dataset contains the abbreviation ESRB. The Entertainment Software Rating Board evaluates a game's content and assigns an age rating such as Teen or Mature.

Step 1. General Information

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure
from scipy import stats as st

#games_df = pd.read_csv(r'C:\Users\RoG\Desktop\Yandex_Course\games.csv')
games_df = pd.read_csv(r'/datasets/games.csv')

pd.options.display.max_rows  = 500

games_df.tail()
Out[1]:
Name Platform Year_of_Release Genre NA_sales EU_sales JP_sales Other_sales Critic_Score User_Score Rating
16710 Samurai Warriors: Sanada Maru PS3 2016.0 Action 0.00 0.00 0.01 0.0 NaN NaN NaN
16711 LMA Manager 2007 X360 2006.0 Sports 0.00 0.01 0.00 0.0 NaN NaN NaN
16712 Haitaka no Psychedelica PSV 2016.0 Adventure 0.00 0.00 0.01 0.0 NaN NaN NaN
16713 Spirits & Spells GBA 2003.0 Platform 0.01 0.00 0.00 0.0 NaN NaN NaN
16714 Winning Post 8 2016 PSV 2016.0 Simulation 0.00 0.00 0.01 0.0 NaN NaN NaN
In [2]:
games_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
Name               16713 non-null object
Platform           16715 non-null object
Year_of_Release    16446 non-null float64
Genre              16713 non-null object
NA_sales           16715 non-null float64
EU_sales           16715 non-null float64
JP_sales           16715 non-null float64
Other_sales        16715 non-null float64
Critic_Score       8137 non-null float64
User_Score         10014 non-null object
Rating             9949 non-null object
dtypes: float64(6), object(5)
memory usage: 1.4+ MB
In [3]:
games_df.describe()
Out[3]:
Year_of_Release NA_sales EU_sales JP_sales Other_sales Critic_Score
count 16446.000000 16715.000000 16715.000000 16715.000000 16715.000000 8137.000000
mean 2006.484616 0.263377 0.145060 0.077617 0.047342 68.967679
std 5.877050 0.813604 0.503339 0.308853 0.186731 13.938165
min 1980.000000 0.000000 0.000000 0.000000 0.000000 13.000000
25% 2003.000000 0.000000 0.000000 0.000000 0.000000 60.000000
50% 2007.000000 0.080000 0.020000 0.000000 0.010000 71.000000
75% 2010.000000 0.240000 0.110000 0.040000 0.030000 79.000000
max 2016.000000 41.360000 28.960000 10.220000 10.570000 98.000000
In [4]:
print('NA:', len(games_df.query('NA_sales == 0')), 'EU:', len(games_df.query('EU_sales == 0')),
      'JP:', len(games_df.query('JP_sales == 0')), 'Other:', len(games_df.query('Other_sales == 0')))
NA: 4508 EU: 5870 JP: 10514 Other: 6601
In [5]:
games_df.duplicated().sum()
Out[5]:
0

Conclusion

The data contains 16715 rows, 11 columns. Most of the missing values are in columns with scores and rating (more than third of values are missing). Also year of release has some missing values, and two suspicious rows with missing names and genres. Columns with information about sales have 0's as missing values or zero sales in region. No duplicated row were found.


Step 2. Preprocessing.

In [6]:
games_df.columns = games_df.columns.str.lower()

Dropping the two noname games:

In [7]:
noname = games_df.query('genre != genre')
games_df.drop(noname.index, inplace=True)
In [8]:
games_df.query('user_score == "tbd"')['year_of_release'].value_counts().sort_index()
Out[8]:
1997.0      1
1999.0      8
2000.0     43
2001.0     82
2002.0    192
2003.0     95
2004.0    107
2005.0    121
2006.0    124
2007.0    192
2008.0    326
2009.0    405
2010.0    331
2011.0    217
2012.0     24
2013.0     15
2014.0     21
2015.0     38
2016.0     34
Name: year_of_release, dtype: int64
In [9]:
games_df['user_score'] = games_df['user_score'].replace('tbd',np.NaN)
games_df['user_score'] = games_df['user_score'].astype(np.float64)

User score is measured in numbers 1 to 10 with decimals. I've replaced the "tbd" values with NaNs as we don't look forward to any data update soon, especially from the games released many years ago, so they are basically all Nans.

In [10]:
games_df.query('rating == rating')['rating'].value_counts()
Out[10]:
E       3990
T       2961
M       1563
E10+    1420
EC         8
K-A        3
RP         3
AO         1
Name: rating, dtype: int64

Rating K-A is the same as E, we should change it to E:

In [11]:
games_df["rating"].replace({'K-A': 'E'}, inplace=True)
games_df.query('rating == rating')['rating'].value_counts()
Out[11]:
E       3993
T       2961
M       1563
E10+    1420
EC         8
RP         3
AO         1
Name: rating, dtype: int64
In [12]:
games_df.query('year_of_release == "NaN"')
Out[12]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating
183 Madden NFL 2004 PS2 NaN Sports 4.26 0.26 0.01 0.71 94.0 8.5 E
377 FIFA Soccer 2004 PS2 NaN Sports 0.59 2.36 0.04 0.51 84.0 6.4 E
456 LEGO Batman: The Videogame Wii NaN Action 1.80 0.97 0.00 0.29 74.0 7.9 E10+
475 wwe Smackdown vs. Raw 2006 PS2 NaN Fighting 1.57 1.02 0.00 0.41 NaN NaN NaN
609 Space Invaders 2600 NaN Shooter 2.36 0.14 0.00 0.03 NaN NaN NaN
627 Rock Band X360 NaN Misc 1.93 0.33 0.00 0.21 92.0 8.2 T
657 Frogger's Adventures: Temple of the Frog GBA NaN Adventure 2.15 0.18 0.00 0.07 73.0 NaN E
678 LEGO Indiana Jones: The Original Adventures Wii NaN Action 1.51 0.61 0.00 0.21 78.0 6.6 E10+
719 Call of Duty 3 Wii NaN Shooter 1.17 0.84 0.00 0.23 69.0 6.7 T
805 Rock Band Wii NaN Misc 1.33 0.56 0.00 0.20 80.0 6.3 T
1131 Call of Duty: Black Ops PC NaN Shooter 0.58 0.81 0.00 0.23 81.0 5.2 M
1142 Rock Band PS3 NaN Misc 0.99 0.41 0.00 0.22 92.0 8.4 T
1301 Triple Play 99 PS NaN Sports 0.81 0.55 0.00 0.10 NaN NaN NaN
1506 Adventure 2600 NaN Adventure 1.21 0.08 0.00 0.01 NaN NaN NaN
1538 LEGO Batman: The Videogame PSP NaN Action 0.57 0.44 0.00 0.27 73.0 7.4 E10+
1585 Combat 2600 NaN Action 1.17 0.07 0.00 0.01 NaN NaN NaN
1609 LEGO Harry Potter: Years 5-7 Wii NaN Action 0.69 0.42 0.00 0.12 76.0 7.8 E10+
1650 NASCAR Thunder 2003 PS2 NaN Racing 0.60 0.46 0.00 0.16 84.0 8.7 E
1699 Hitman 2: Silent Assassin XB NaN Action 0.76 0.38 0.00 0.05 84.0 8.0 M
1840 Rock Band PS2 NaN Misc 0.71 0.06 0.00 0.35 82.0 6.8 T
1984 Legacy of Kain: Soul Reaver PS NaN Action 0.58 0.40 0.00 0.07 91.0 9.0 T
2010 Donkey Kong Land III GB NaN Platform 0.68 0.31 0.00 0.04 NaN NaN NaN
2106 Air-Sea Battle 2600 NaN Shooter 0.91 0.06 0.00 0.01 NaN NaN NaN
2108 Suikoden III PS2 NaN Role-Playing 0.29 0.23 0.38 0.08 86.0 7.7 T
2132 LEGO Harry Potter: Years 5-7 X360 NaN Action 0.51 0.37 0.00 0.09 77.0 7.9 E10+
2157 Wheel of Fortune PS2 NaN Misc 0.47 0.36 0.00 0.12 NaN NaN E
2169 Yakuza 4 PS3 NaN Action 0.15 0.13 0.63 0.05 78.0 8.0 M
2273 LEGO Harry Potter: Years 5-7 PS3 NaN Action 0.36 0.41 0.00 0.15 76.0 8.3 E10+
2281 Namco Museum XB NaN Misc 0.77 0.11 0.00 0.04 59.0 NaN E
2361 Rhythm Heaven Wii NaN Misc 0.11 0.00 0.77 0.01 NaN NaN NaN
2453 The Lord of the Rings: War in the North X360 NaN Action 0.52 0.24 0.00 0.08 61.0 7.4 M
2479 Madden NFL 07 PSP NaN Sports 0.77 0.03 0.00 0.04 78.0 6.6 E
2492 MLB SlugFest 20-03 PS2 NaN Sports 0.41 0.32 0.00 0.11 77.0 8.2 E
2522 The Lord of the Rings: War in the North PS3 NaN Action 0.25 0.42 0.01 0.13 63.0 7.0 M
2536 Shaun White Snowboarding X360 NaN Sports 0.48 0.25 0.00 0.08 60.0 7.6 T
2572 PES 2009: Pro Evolution Soccer PSP NaN Sports 0.04 0.33 0.26 0.17 NaN NaN NaN
2773 WarioWare: Twisted! GBA NaN Puzzle 0.16 0.06 0.50 0.02 NaN NaN NaN
2849 Madden NFL 11 Wii NaN Sports 0.68 0.00 0.00 0.04 75.0 5.4 E
2969 Test Drive Unlimited 2 X360 NaN Racing 0.30 0.31 0.00 0.07 68.0 6.4 T
3024 The Chronicles of Narnia: The Lion, The Witch ... GBA NaN Action 0.48 0.18 0.00 0.01 66.0 6.8 E
3081 LEGO Harry Potter: Years 5-7 DS NaN Action 0.34 0.25 0.00 0.07 69.0 NaN E10+
3187 Monster Hunter 2 PS2 NaN Role-Playing 0.00 0.00 0.63 0.00 NaN NaN NaN
3223 Metal Gear Solid 2: Substance XB NaN Action 0.38 0.22 0.00 0.03 87.0 8.5 M
3233 Test Drive Unlimited 2 PS3 NaN Racing 0.16 0.34 0.01 0.12 70.0 6.1 T
3289 Advance Wars: Days of Ruin DS NaN Strategy 0.43 0.12 0.00 0.05 86.0 8.7 E10+
3352 The Golden Compass Wii NaN Action 0.26 0.28 0.00 0.07 35.0 6.8 E10+
3413 Madden NFL 06 X360 NaN Sports 0.54 0.00 0.01 0.03 74.0 4.9 E
3459 NASCAR: Dirt to Daytona PS2 NaN Racing 0.28 0.22 0.00 0.07 84.0 8.8 E
3486 Madden NFL 2002 XB NaN Sports 0.53 0.02 0.00 0.03 90.0 8.1 E
3704 Def Jam: Fight for NY XB NaN Fighting 0.43 0.10 0.00 0.02 84.0 8.5 M
3739 NBA Street Vol. 2 GC NaN Sports 0.41 0.11 0.00 0.01 88.0 8.1 E
3883 Fishing Derby 2600 NaN Sports 0.48 0.03 0.00 0.01 NaN NaN NaN
3997 Wet X360 NaN Shooter 0.23 0.21 0.01 0.05 69.0 7.3 M
4127 Sonic the Hedgehog PS3 NaN Platform 0.00 0.48 0.00 0.00 43.0 4.1 E10+
4132 Karate 2600 NaN Fighting 0.44 0.03 0.00 0.00 NaN NaN NaN
4205 Tiger Woods PGA Tour 07 Wii NaN Sports 0.43 0.00 0.00 0.04 71.0 6.9 E
4220 Circus Atari 2600 NaN Action 0.43 0.03 0.00 0.00 NaN NaN NaN
4294 The Chronicles of Riddick: Escape from Butcher... XB NaN Shooter 0.32 0.11 0.00 0.02 89.0 8.9 M
4337 Maze Craze: A Game of Cops 'n Robbers 2600 NaN Action 0.42 0.02 0.00 0.00 NaN NaN NaN
4373 Silent Hill: Homecoming X360 NaN Action 0.25 0.15 0.00 0.04 70.0 6.9 M
4437 Super Breakout 2600 NaN Puzzle 0.41 0.03 0.00 0.00 NaN NaN NaN
4445 Robert Ludlum's The Bourne Conspiracy X360 NaN Action 0.26 0.13 0.00 0.04 71.0 7.2 T
4635 NHL Slapshot Wii NaN Sports 0.39 0.00 0.00 0.02 76.0 8.1 E
4648 TERA PC NaN Role-Playing 0.25 0.12 0.00 0.05 77.0 6.8 M
4775 NFL GameDay 2003 PS2 NaN Sports 0.20 0.15 0.00 0.05 60.0 NaN E
4818 LEGO Harry Potter: Years 5-7 3DS NaN Action 0.18 0.19 0.00 0.03 71.0 6.5 E10+
4848 Harvest Moon: Save the Homeland PS2 NaN Simulation 0.19 0.15 0.00 0.05 76.0 8.8 E
4945 Robert Ludlum's The Bourne Conspiracy PS3 NaN Action 0.18 0.14 0.00 0.06 70.0 7.6 T
4959 Silent Hill: Homecoming PS3 NaN Action 0.18 0.14 0.00 0.06 71.0 6.9 M
5041 Hangman 2600 NaN Puzzle 0.35 0.02 0.00 0.00 NaN NaN NaN
5070 The Golden Compass PSP NaN Action 0.11 0.16 0.00 0.10 28.0 4.1 E10+
5156 NBA Live 2003 XB NaN Sports 0.31 0.04 0.00 0.01 82.0 8.8 E
5194 Cubix Robots for Everyone: Clash 'n' Bash GBA NaN Action 0.26 0.10 0.00 0.01 NaN NaN NaN
5294 Dragon Ball Z: Budokai Tenkaichi 2 (JP sales) Wii NaN Action 0.15 0.05 0.14 0.01 NaN NaN NaN
5326 Tropico 4 X360 NaN Strategy 0.20 0.12 0.00 0.03 77.0 7.5 T
5336 Tomb Raider (2013) PC NaN Action 0.06 0.25 0.00 0.04 86.0 8.5 M
5443 Bejeweled 3 DS NaN Puzzle 0.29 0.02 0.00 0.02 75.0 NaN E
5492 Custom Robo N64 NaN Role-Playing 0.00 0.00 0.29 0.04 NaN NaN NaN
5588 Final Fantasy XI PS2 NaN Role-Playing 0.08 0.06 0.15 0.02 85.0 6.9 T
5624 Singularity X360 NaN Shooter 0.24 0.05 0.00 0.02 76.0 7.7 M
5642 Dragster 2600 NaN Racing 0.30 0.02 0.00 0.00 NaN NaN NaN
5655 All-Star Baseball 2005 PS2 NaN Sports 0.16 0.12 0.00 0.04 72.0 8.6 E
5769 Star Wars Jedi Knight II: Jedi Outcast XB NaN Shooter 0.23 0.07 0.00 0.01 81.0 6.8 T
5800 Slot Machine 2600 NaN Action 0.29 0.02 0.00 0.00 NaN NaN NaN
5857 The Dukes of Hazzard II: Daisy Dukes It Out PS NaN Racing 0.17 0.11 0.00 0.02 53.0 NaN E
5874 Harvest Moon: The Tale of Two Towns 3DS NaN Simulation 0.28 0.00 0.00 0.02 NaN NaN NaN
5889 NBA Live 2003 GC NaN Sports 0.23 0.06 0.00 0.01 82.0 8.2 E
5932 Shrek the Third DS NaN Action 0.26 0.01 0.00 0.02 70.0 6.5 E
6019 Nicktoons: Battle for Volcano Island GC NaN Action 0.22 0.06 0.00 0.01 NaN 7.0 E
6101 Haven: Call of the King PS2 NaN Platform 0.14 0.11 0.00 0.04 69.0 6.9 T
6133 Unreal Championship 2: The Liandri Conflict XB NaN Shooter 0.22 0.05 0.00 0.01 85.0 8.2 M
6180 The Chronicles of Narnia: The Lion, The Witch ... GC NaN Action 0.22 0.06 0.00 0.01 71.0 NaN T
6199 Pac-Man Fever GC NaN Misc 0.21 0.06 0.00 0.01 54.0 6.3 E
6255 The Legend of Zelda: The Minish Cap(weekly JP ... GBA NaN Action 0.00 0.00 0.27 0.01 NaN NaN NaN
6266 Indy 500 2600 NaN Racing 0.26 0.01 0.00 0.00 NaN NaN NaN
6293 Disgaea 3: Absence of Detention PSV NaN Role-Playing 0.11 0.05 0.07 0.03 78.0 7.6 T
6299 Flag Capture 2600 NaN Action 0.25 0.02 0.00 0.00 NaN NaN NaN
6342 Gun X360 NaN Shooter 0.24 0.01 0.00 0.02 75.0 7.5 M
6356 Rock Revolution PS3 NaN Misc 0.25 0.00 0.00 0.02 42.0 6.6 T
6476 LEGO Harry Potter: Years 5-7 PSP NaN Action 0.10 0.10 0.00 0.06 NaN NaN E10+
6624 College Hoops 2K6 PS2 NaN Sports 0.12 0.10 0.00 0.03 77.0 7.3 E
6636 Jonah Lomu Rugby Challenge PS3 NaN Sports 0.00 0.19 0.00 0.06 64.0 NaN E
6773 Mega Man X Collection PS2 NaN Misc 0.12 0.09 0.00 0.03 73.0 8.8 E
6818 BioShock 2 PC NaN Shooter 0.02 0.19 0.00 0.04 88.0 8.0 M
6861 Singularity PS3 NaN Shooter 0.17 0.04 0.01 0.03 77.0 7.7 M
6899 Danganronpa: Trigger Happy Havoc PSP NaN Misc 0.00 0.00 0.24 0.00 NaN NaN NaN
6954 Jet X20 PS2 NaN Racing 0.11 0.09 0.00 0.03 NaN NaN NaN
6999 Tony Hawk's Downhill Jam Wii NaN Sports 0.21 0.00 0.00 0.02 69.0 6.2 E10+
7008 Tribes: Aerial Assault PS2 NaN Shooter 0.11 0.09 0.00 0.03 73.0 8.8 T
7108 Big Beach Sports 2 Wii NaN Sports 0.09 0.11 0.00 0.02 NaN NaN E
7210 LEGO Harry Potter: Years 5-7 PC NaN Action 0.05 0.14 0.00 0.03 80.0 8.5 E10+
7332 Yu Yu Hakusho: Dark Tournament PS2 NaN Fighting 0.10 0.08 0.00 0.03 56.0 8.5 T
7353 Ghostbusters II 2600 NaN Action 0.20 0.01 0.00 0.00 NaN NaN NaN
7354 Breakaway IV 2600 NaN Puzzle 0.20 0.01 0.00 0.00 NaN NaN NaN
7367 Robotech: Battlecry XB NaN Shooter 0.16 0.05 0.00 0.01 74.0 7.0 T
7387 Move Fitness PS3 NaN Sports 0.00 0.16 0.00 0.05 NaN NaN NaN
7391 Valkyria Chronicles III: Unrecorded Chronicles PSP NaN Strategy 0.00 0.00 0.21 0.00 NaN NaN NaN
7406 DanceDanceRevolution II Wii NaN Misc 0.20 0.00 0.00 0.01 NaN NaN E10+
7584 WRC: FIA World Rally Championship PS3 NaN Racing 0.00 0.14 0.01 0.04 68.0 4.5 NaN
7605 Famista 64 N64 NaN Sports 0.00 0.00 0.17 0.03 NaN NaN NaN
7724 Dead Space 3 PC NaN Action 0.02 0.16 0.00 0.02 78.0 6.0 M
7735 Test Drive Unlimited 2 PC NaN Racing 0.05 0.11 0.00 0.03 72.0 5.7 T
7848 Pet Zombies 3DS NaN Simulation 0.18 0.00 0.00 0.01 NaN NaN T
8023 Star Trek: Legacy X360 NaN Simulation 0.14 0.02 0.00 0.01 64.0 5.5 E10+
8067 Backbreaker X360 NaN Sports 0.17 0.00 0.00 0.01 54.0 7.6 E
8150 Twisted Metal: Small Brawl PS NaN Action 0.10 0.07 0.00 0.01 51.0 7.6 T
8151 Otomedius Excellent X360 NaN Shooter 0.13 0.00 0.04 0.01 48.0 8.1 T
8197 NBA Starting Five PS2 NaN Sports 0.09 0.07 0.00 0.02 53.0 7.3 E
8212 Teen Titans GBA NaN Action 0.13 0.05 0.00 0.00 61.0 NaN E10+
8250 Trauma Team Wii NaN Simulation 0.14 0.00 0.02 0.01 82.0 8.8 T
8260 Backbreaker PS3 NaN Sports 0.16 0.00 0.00 0.01 58.0 7.0 E
8319 James Cameron's Dark Angel PS2 NaN Action 0.08 0.07 0.00 0.02 48.0 6.5 T
8364 Sword of the Samurai PS2 NaN Fighting 0.00 0.00 0.17 0.00 NaN NaN NaN
8500 Splatterhouse PS3 NaN Action 0.13 0.02 0.00 0.02 59.0 7.7 M
8547 Alone in the Dark: The New Nightmare PS NaN Adventure 0.09 0.06 0.00 0.01 77.0 8.1 M
8632 Vegas Party Wii NaN Misc 0.15 0.00 0.00 0.01 33.0 NaN T
8680 Jurassic Park: The Game X360 NaN Action 0.15 0.00 0.00 0.01 60.0 3.6 T
8740 Home Run 2600 NaN Sports 0.14 0.01 0.00 0.00 NaN NaN NaN
8884 eJay Clubworld PS2 NaN Misc 0.07 0.06 0.00 0.02 69.0 8.6 E
8918 All-Star Baseball 2005 XB NaN Sports 0.11 0.03 0.00 0.01 75.0 8.8 E
9043 Bejeweled 3 PS3 NaN Puzzle 0.13 0.00 0.00 0.02 76.0 8.2 E
9119 Our House Party! Wii NaN Simulation 0.13 0.00 0.00 0.01 NaN NaN NaN
9151 WCW Backstage Assault N64 NaN Action 0.11 0.03 0.00 0.00 NaN NaN NaN
9218 Bejeweled 3 X360 NaN Puzzle 0.13 0.00 0.00 0.01 78.0 8.4 E
9293 Disney's Cinderella: Magical Dreams GBA NaN Platform 0.10 0.04 0.00 0.00 69.0 NaN E
9380 Transworld Surf XB NaN Sports 0.10 0.03 0.00 0.00 76.0 NaN T
9485 Street Fighter IV PC NaN Fighting 0.07 0.05 0.00 0.02 91.0 8.0 T
9516 Nintendo Puzzle Collection GC NaN Puzzle 0.00 0.00 0.13 0.00 NaN NaN NaN
9664 Charm Girls Club: My Fashion Mall DS NaN Simulation 0.12 0.00 0.00 0.01 NaN NaN E
9703 Record of Agarest War Zero PS3 NaN Strategy 0.09 0.00 0.03 0.01 50.0 6.7 T
9706 Rocksmith PC NaN Misc 0.06 0.04 0.00 0.01 78.0 8.1 T
9738 Super Robot Wars OG Saga: Masou Kishin II - Re... PSP NaN Strategy 0.00 0.00 0.12 0.00 NaN NaN NaN
9761 Saru! Get You! Million Monkeys PS2 NaN Platform 0.00 0.00 0.12 0.00 NaN NaN NaN
9817 Street Hoops GC NaN Sports 0.09 0.02 0.00 0.00 56.0 7.3 T
9827 WRC: FIA World Rally Championship X360 NaN Racing 0.00 0.10 0.00 0.02 66.0 6.0 E10+
9830 Godzilla: Destroy All Monsters Melee XB NaN Fighting 0.09 0.03 0.00 0.00 71.0 6.8 T
9855 The Daring Game for Girls DS NaN Adventure 0.11 0.00 0.00 0.01 NaN NaN E
9876 Major League Baseball 2K6 PSP NaN Sports 0.11 0.00 0.00 0.01 69.0 NaN E
10050 Star Trek: Conquest PS2 NaN Strategy 0.06 0.04 0.00 0.01 54.0 9.1 E
10258 GiFTPiA GC NaN Role-Playing 0.00 0.00 0.11 0.00 NaN NaN NaN
10317 Happy Feet Two PS3 NaN Action 0.09 0.00 0.00 0.01 NaN NaN NaN
10465 Disney's Chicken Little: Ace In Action Wii NaN Shooter 0.09 0.00 0.00 0.01 72.0 NaN E10+
10486 Atsumare! Power Pro Kun no DS Koushien DS NaN Sports 0.00 0.00 0.10 0.00 NaN NaN NaN
10612 My Healthy Cooking Coach DS NaN Simulation 0.09 0.00 0.00 0.01 NaN NaN E
10686 Happy Feet Two DS NaN Action 0.08 0.01 0.00 0.01 NaN NaN E
10764 Luminous Arc 2 (JP sales) DS NaN Role-Playing 0.00 0.00 0.10 0.00 NaN NaN NaN
10790 Happy Feet Two X360 NaN Action 0.08 0.01 0.00 0.01 NaN NaN NaN
10837 Egg Monster Hero DS NaN Role-Playing 0.00 0.00 0.09 0.00 NaN NaN E
10993 The Daring Game for Girls Wii NaN Adventure 0.09 0.00 0.00 0.01 NaN NaN E
11000 Demon Chaos PS2 NaN Action 0.00 0.00 0.09 0.00 NaN NaN NaN
11016 Samurai Shodown Anthology PSP NaN Fighting 0.08 0.00 0.00 0.01 NaN NaN T
11092 Action Man-Operation Extreme PS NaN Action 0.05 0.03 0.00 0.01 NaN NaN NaN
11119 Super Puzzle Fighter II GBA NaN Puzzle 0.06 0.02 0.00 0.00 83.0 9.5 E
11338 Charm Girls Club: My Fashion Show DS NaN Simulation 0.08 0.00 0.00 0.01 NaN NaN E
11377 Face Racers: Photo Finish 3DS NaN Racing 0.08 0.00 0.00 0.01 NaN NaN E
11433 Zero: Tsukihami no Kamen Wii NaN Action 0.00 0.00 0.08 0.00 NaN NaN NaN
11455 The Hidden 3DS NaN Adventure 0.08 0.00 0.00 0.01 NaN 4.2 E10+
11550 Get Fit with Mel B X360 NaN Sports 0.00 0.06 0.00 0.01 57.0 NaN E
11565 Rock Revolution Wii NaN Misc 0.07 0.00 0.00 0.01 46.0 NaN T
11622 Happy Feet Two Wii NaN Action 0.06 0.01 0.00 0.01 NaN NaN NaN
11790 Mega Man Battle Network: Operation Shooting Star DS NaN Role-Playing 0.00 0.00 0.07 0.00 NaN NaN NaN
11832 Smashing Drive GC NaN Racing 0.06 0.01 0.00 0.00 47.0 5.2 T
11925 Port Royale 3 PS3 NaN Simulation 0.06 0.00 0.00 0.01 NaN 8.9 T
11943 Dream Trigger 3D 3DS NaN Shooter 0.04 0.02 0.00 0.01 54.0 6.3 E
11954 Dead Island: Riptide PC NaN Action 0.00 0.07 0.00 0.01 61.0 6.0 M
11957 Yoostar on MTV X360 NaN Misc 0.07 0.00 0.00 0.01 49.0 NaN T
12003 Tornado DS NaN Action 0.07 0.00 0.00 0.00 44.0 NaN E
12011 McFarlane's Evil Prophecy PS2 NaN Action 0.03 0.03 0.00 0.01 34.0 3.2 M
12046 Drake of the 99 Dragons XB NaN Shooter 0.05 0.01 0.00 0.00 22.0 1.8 T
12209 Build-A-Bear Workshop: Friendship Valley Wii NaN Misc 0.06 0.00 0.00 0.00 NaN NaN E
12217 Rayman Arena XB NaN Racing 0.05 0.01 0.00 0.00 50.0 NaN E
12221 Port Royale 3 X360 NaN Simulation 0.06 0.00 0.00 0.01 56.0 4.5 T
12235 National Geographic Challenge! X360 NaN Misc 0.06 0.00 0.00 0.00 NaN NaN NaN
12241 Alex Rider: Stormbreaker DS NaN Action 0.06 0.00 0.00 0.00 48.0 6.1 T
12524 Chou Soujuu Mecha MG DS NaN Simulation 0.00 0.00 0.06 0.00 NaN NaN NaN
12568 Prinny: Can I Really Be The Hero? (US sales) PSP NaN Action 0.06 0.00 0.00 0.00 NaN NaN NaN
12584 Combat Elite: WWII Paratroopers PS2 NaN Shooter 0.03 0.02 0.00 0.01 54.0 NaN T
12718 Flip's Twisted World Wii NaN Platform 0.05 0.00 0.00 0.00 47.0 NaN E
12734 Mobile Ops: The One Year War X360 NaN Simulation 0.00 0.00 0.06 0.00 NaN NaN NaN
12784 Tom Clancy's Rainbow Six: Critical Hour XB NaN Shooter 0.04 0.01 0.00 0.00 54.0 3.6 M
12880 Jewel Link Chronicles: Mountains of Madness DS NaN Puzzle 0.00 0.05 0.00 0.01 NaN NaN E10+
12896 Captain America: Super Soldier Wii NaN Action 0.05 0.00 0.00 0.00 57.0 6.0 T
12959 Mountain Bike Adrenaline PS2 NaN Sports 0.03 0.02 0.00 0.01 NaN NaN E
12970 Drill Dozer GBA NaN Platform 0.04 0.01 0.00 0.00 81.0 8.3 E
13060 Captain America: Super Soldier DS NaN Action 0.05 0.00 0.00 0.00 NaN NaN NaN
13142 GRID DS NaN Racing 0.04 0.00 0.00 0.00 79.0 8.1 E
13195 Tour de France 2011 X360 NaN Racing 0.00 0.04 0.00 0.01 46.0 7.6 NaN
13289 Reader Rabbit 2nd Grade Wii NaN Misc 0.04 0.00 0.00 0.00 NaN NaN E
13317 Monster Hunter Frontier Online PS3 NaN Role-Playing 0.00 0.00 0.05 0.00 NaN NaN NaN
13564 RollerCoaster Tycoon PC NaN Strategy 0.02 0.02 0.00 0.01 NaN NaN E
13609 Battle vs. Chess PS3 NaN Misc 0.00 0.03 0.00 0.01 NaN NaN T
13647 The History Channel: Great Battles - Medieval PS3 NaN Strategy 0.00 0.03 0.00 0.01 NaN NaN NaN
13672 Clockwork Empires PC NaN Strategy 0.00 0.04 0.00 0.00 58.0 3.8 RP
13711 B.L.U.E.: Legend of Water PS NaN Adventure 0.00 0.00 0.04 0.00 NaN NaN NaN
13781 GRID PC NaN Racing 0.00 0.03 0.00 0.01 87.0 7.9 E
13792 NHL Hitz Pro GC NaN Sports 0.03 0.01 0.00 0.00 81.0 7.8 E
13874 Luxor: Pharaoh's Challenge Wii NaN Puzzle 0.01 0.02 0.00 0.00 NaN NaN E
13929 Sega Rally 2006 PS2 NaN Racing 0.00 0.00 0.04 0.00 NaN NaN NaN
14105 Half-Minute Hero 2 PSP NaN Role-Playing 0.00 0.00 0.04 0.00 NaN NaN NaN
14126 Housekeeping DS NaN Action 0.00 0.00 0.04 0.00 NaN NaN NaN
14141 Major League Baseball 2K8 PSP NaN Sports 0.03 0.00 0.00 0.00 63.0 NaN E
14210 Sabre Wulf GBA NaN Platform 0.02 0.01 0.00 0.00 75.0 8.0 E
14252 Swords Wii NaN Fighting 0.03 0.00 0.00 0.00 NaN NaN T
14325 Beyond the Labyrinth 3DS NaN Role-Playing 0.00 0.00 0.03 0.00 NaN NaN NaN
14350 Bikkuriman Daijiten DS NaN Misc 0.00 0.00 0.03 0.00 NaN NaN NaN
14369 Majesty 2: The Fantasy Kingdom Sim X360 NaN Simulation 0.03 0.00 0.00 0.00 NaN NaN NaN
14449 Fullmetal Alchemist: Brotherhood PSP NaN Action 0.00 0.00 0.03 0.00 NaN NaN NaN
14546 Combat Elite: WWII Paratroopers XB NaN Shooter 0.02 0.01 0.00 0.00 56.0 NaN T
14597 Samurai Spirits: Tenkaichi Kenkakuden PS2 NaN Fighting 0.00 0.00 0.03 0.00 NaN NaN NaN
14685 World of Tanks X360 NaN Shooter 0.00 0.03 0.00 0.00 NaN 5.2 NaN
14746 Battle vs. Chess PC NaN Misc 0.00 0.02 0.00 0.00 69.0 5.9 T
14749 Tom and Jerry in War of the Whiskers XB NaN Fighting 0.02 0.01 0.00 0.00 NaN NaN NaN
14769 Super Duper Sumos GBA NaN Action 0.02 0.01 0.00 0.00 57.0 NaN E
14931 The King of Fighters: Maximum Impact - Maniax XB NaN Fighting 0.02 0.01 0.00 0.00 NaN NaN NaN
14950 Combat Wings: The Great Battles of WWII Wii NaN Simulation 0.02 0.00 0.00 0.00 NaN NaN NaN
15006 Tube Slider GC NaN Racing 0.02 0.00 0.00 0.00 62.0 NaN E
15022 Umineko no Naku Koro ni San: Shinjitsu to Gens... PS3 NaN Adventure 0.00 0.00 0.02 0.00 NaN NaN NaN
15079 Payout Poker & Casino PSP NaN Misc 0.02 0.00 0.00 0.00 NaN NaN NaN
15081 Wii de Asobu: Metroid Prime Wii NaN Shooter 0.00 0.00 0.02 0.00 NaN NaN NaN
15119 Legacy of Ys: Books I & II DS NaN Role-Playing 0.02 0.00 0.00 0.00 NaN NaN NaN
15267 Saint Wii NaN Shooter 0.02 0.00 0.00 0.00 NaN NaN E10+
15292 Steal Princess DS NaN Platform 0.02 0.00 0.00 0.00 60.0 NaN E10+
15338 Mario Tennis 3DS NaN Sports 0.00 0.00 0.02 0.00 NaN NaN NaN
15535 Runaway: A Twist of Fate DS NaN Adventure 0.00 0.02 0.00 0.00 66.0 NaN T
15557 Yu-Gi-Oh! 5D's Wheelie Breakers (JP sales) Wii NaN Racing 0.00 0.00 0.02 0.00 NaN NaN NaN
15675 Cabela's Alaskan Adventure PS2 NaN Sports 0.01 0.01 0.00 0.00 NaN NaN NaN
15704 Writing and Speaking Beautiful Japanese DS DS NaN Misc 0.00 0.00 0.02 0.00 NaN NaN NaN
15754 Virtua Quest PS2 NaN Role-Playing 0.01 0.01 0.00 0.00 53.0 7.6 T
15795 Shonen Jump's Yu-Gi-Oh! GX Card Almanac DS NaN Misc 0.00 0.00 0.02 0.00 NaN NaN NaN
15816 Without Warning XB NaN Shooter 0.01 0.00 0.00 0.00 45.0 2.0 M
15953 PDC World Championship Darts 2008 DS NaN Sports 0.01 0.00 0.00 0.00 NaN NaN NaN
15966 Dinotopia: The Sunstone Odyssey GC NaN Action 0.01 0.00 0.00 0.00 50.0 NaN T
15998 Jet Impulse DS NaN Simulation 0.00 0.00 0.02 0.00 NaN NaN NaN
16017 Dream Dancer DS NaN Misc 0.01 0.00 0.00 0.00 NaN NaN E
16059 Dance! It's Your Stage Wii NaN Misc 0.00 0.01 0.00 0.00 NaN NaN NaN
16079 Football Manager 2007 X360 NaN Sports 0.00 0.01 0.00 0.00 NaN NaN NaN
16080 Ferrari: The Race Experience Wii NaN Racing 0.00 0.01 0.00 0.00 54.0 NaN E
16157 Aquaman: Battle for Atlantis XB NaN Action 0.01 0.00 0.00 0.00 26.0 2.7 T
16180 WRC: FIA World Rally Championship PC NaN Racing 0.00 0.01 0.00 0.00 65.0 NaN RP
16277 Homeworld Remastered Collection PC NaN Strategy 0.00 0.01 0.00 0.00 86.0 8.2 E10+
16288 Shorts DS NaN Platform 0.01 0.00 0.00 0.00 NaN NaN E10+
16293 AKB1/48: Idol to Guam de Koishitara... X360 NaN Misc 0.00 0.00 0.01 0.00 NaN NaN NaN
16329 Brothers in Arms: Furious 4 X360 NaN Shooter 0.01 0.00 0.00 0.00 NaN NaN M
16348 Agarest Senki: Re-appearance PS3 NaN Role-Playing 0.00 0.00 0.01 0.00 NaN NaN NaN
16373 PDC World Championship Darts 2008 PSP NaN Sports 0.01 0.00 0.00 0.00 43.0 NaN E10+
16405 Freaky Flyers GC NaN Racing 0.01 0.00 0.00 0.00 69.0 6.5 T
16448 Inversion PC NaN Shooter 0.01 0.00 0.00 0.00 59.0 6.7 M
16458 Hakuouki: Shinsengumi Kitan PS3 NaN Adventure 0.01 0.00 0.00 0.00 NaN NaN NaN
16522 Virtua Quest GC NaN Role-Playing 0.01 0.00 0.00 0.00 55.0 5.5 T

As we can see, some of the games contain year of release in their name, let's extract them and fix at least a couple of rows:

In [13]:
games_df['year_of_release'].isna().sum()
Out[13]:
269
In [14]:
year_dict = (games_df.query('year_of_release != year_of_release')['name']
             .str.extract('(\d+)').reset_index().set_index('index')[0].to_dict())
In [15]:
games_df['year_of_release'] = games_df.apply(lambda x: year_dict[x.name] if pd.isna(x['year_of_release']) 
                                             else x['year_of_release'], axis=1)

fltr = pd.to_numeric(games_df['year_of_release']) > 1979
games_df['year_of_release'].where(fltr, 0, inplace=True)
In [16]:
len(games_df.query('year_of_release == 0'))
Out[16]:
252
In [17]:
games_df['year_of_release'] = games_df['year_of_release'].astype(np.int64)

I've decided to change the year column type to integer as it looks more proper for year column, without decimals.
Let's continue investigating the 'year_of_release' column. Let's check it for consistency, i.e. for every console the right years represented:

In [18]:
year_games_info = (games_df.query('year_of_release != 0').groupby(['platform'])['year_of_release']
                    .value_counts().sort_index().to_frame())
year_games_info.columns = ['count']
year_games_info.reset_index(level=[1], inplace=True)
year_games_info['diff'] = year_games_info['year_of_release'].diff().to_frame()
year_games_info
Out[18]:
year_of_release count diff
platform
2600 1980 9 NaN
2600 1981 46 1.0
2600 1982 36 1.0
2600 1983 11 1.0
2600 1984 1 1.0
2600 1985 1 1.0
2600 1986 2 1.0
2600 1987 6 1.0
2600 1988 2 1.0
2600 1989 2 1.0
3DO 1994 1 5.0
3DO 1995 2 1.0
3DS 2011 116 16.0
3DS 2012 93 1.0
3DS 2013 91 1.0
3DS 2014 80 1.0
3DS 2015 86 1.0
3DS 2016 46 1.0
DC 1998 7 -18.0
DC 1999 14 1.0
DC 2000 20 1.0
DC 2001 8 1.0
DC 2002 1 1.0
DC 2007 1 5.0
DC 2008 1 1.0
DS 1985 1 -23.0
DS 2004 23 19.0
DS 2005 118 1.0
DS 2006 201 1.0
DS 2007 376 1.0
DS 2008 493 1.0
DS 2009 403 1.0
DS 2010 323 1.0
DS 2011 153 1.0
DS 2012 23 1.0
DS 2013 8 1.0
GB 1988 1 -25.0
GB 1989 10 1.0
GB 1990 3 1.0
GB 1991 9 1.0
GB 1992 7 1.0
GB 1994 3 2.0
GB 1995 4 1.0
GB 1996 6 1.0
GB 1997 8 1.0
GB 1998 8 1.0
GB 1999 11 1.0
GB 2000 17 1.0
GB 2001 10 1.0
GBA 2000 1 -1.0
GBA 2001 107 1.0
GBA 2002 198 1.0
GBA 2003 150 1.0
GBA 2004 175 1.0
GBA 2005 135 1.0
GBA 2006 39 1.0
GBA 2007 6 1.0
GC 2001 22 -6.0
GC 2002 152 1.0
GC 2003 145 1.0
GC 2004 83 1.0
GC 2005 97 1.0
GC 2006 40 1.0
GC 2007 4 1.0
GEN 1990 1 -17.0
GEN 1991 1 1.0
GEN 1992 6 1.0
GEN 1993 7 1.0
GEN 1994 12 1.0
GG 1992 1 -2.0
N64 1996 18 4.0
N64 1997 49 1.0
N64 1998 77 1.0
N64 1999 102 1.0
N64 2000 60 1.0
N64 2001 9 1.0
N64 2002 1 1.0
NES 1983 6 -19.0
NES 1984 13 1.0
NES 1985 11 1.0
NES 1986 19 1.0
NES 1987 10 1.0
NES 1988 11 1.0
NES 1989 5 1.0
NES 1990 8 1.0
NES 1991 7 1.0
NES 1992 3 1.0
NES 1993 4 1.0
NES 1994 1 1.0
NG 1993 2 -1.0
NG 1994 4 1.0
NG 1995 4 1.0
NG 1996 2 1.0
PC 1985 1 -11.0
PC 1988 1 3.0
PC 1992 5 4.0
PC 1994 6 2.0
PC 1995 2 1.0
PC 1996 4 1.0
PC 1997 6 1.0
PC 1998 8 1.0
PC 1999 7 1.0
PC 2000 7 1.0
PC 2001 15 1.0
PC 2002 19 1.0
PC 2003 33 1.0
PC 2004 30 1.0
PC 2005 37 1.0
PC 2006 52 1.0
PC 2007 62 1.0
PC 2008 76 1.0
PC 2009 107 1.0
PC 2010 90 1.0
PC 2011 139 1.0
PC 2012 61 1.0
PC 2013 39 1.0
PC 2014 47 1.0
PC 2015 50 1.0
PC 2016 54 1.0
PCFX 1996 1 -20.0
PS 1994 17 -2.0
PS 1995 99 1.0
PS 1996 164 1.0
PS 1997 188 1.0
PS 1998 248 1.0
PS 1999 200 1.0
PS 2000 160 1.0
PS 2001 91 1.0
PS 2002 20 1.0
PS 2003 3 1.0
PS2 2000 82 -3.0
PS2 2001 185 1.0
PS2 2002 280 1.0
PS2 2003 258 1.0
PS2 2004 261 1.0
PS2 2005 261 1.0
PS2 2006 261 1.0
PS2 2007 214 1.0
PS2 2008 191 1.0
PS2 2009 96 1.0
PS2 2010 38 1.0
PS2 2011 7 1.0
PS3 2006 27 -5.0
PS3 2007 90 1.0
PS3 2008 138 1.0
PS3 2009 162 1.0
PS3 2010 181 1.0
PS3 2011 215 1.0
PS3 2012 148 1.0
PS3 2013 126 1.0
PS3 2014 108 1.0
PS3 2015 73 1.0
PS3 2016 38 1.0
PS4 2013 16 -3.0
PS4 2014 75 1.0
PS4 2015 137 1.0
PS4 2016 164 1.0
PSP 2004 15 -12.0
PSP 2005 95 1.0
PSP 2006 189 1.0
PSP 2007 133 1.0
PSP 2008 101 1.0
PSP 2009 162 1.0
PSP 2010 188 1.0
PSP 2011 139 1.0
PSP 2012 106 1.0
PSP 2013 54 1.0
PSP 2014 10 1.0
PSP 2015 3 1.0
PSV 2011 18 -4.0
PSV 2012 53 1.0
PSV 2013 63 1.0
PSV 2014 100 1.0
PSV 2015 110 1.0
PSV 2016 85 1.0
SAT 1994 19 -22.0
SAT 1995 52 1.0
SAT 1996 38 1.0
SAT 1997 34 1.0
SAT 1998 29 1.0
SAT 1999 1 1.0
SCD 1993 1 -6.0
SCD 1994 5 1.0
SNES 1990 4 -4.0
SNES 1991 24 1.0
SNES 1992 21 1.0
SNES 1993 46 1.0
SNES 1994 53 1.0
SNES 1995 54 1.0
SNES 1996 30 1.0
SNES 1997 4 1.0
SNES 1998 2 1.0
SNES 1999 1 1.0
TG16 1995 2 -4.0
WS 1999 2 4.0
WS 2000 2 1.0
WS 2001 2 1.0
Wii 2006 44 5.0
Wii 2007 185 1.0
Wii 2008 282 1.0
Wii 2009 325 1.0
Wii 2010 253 1.0
Wii 2011 143 1.0
Wii 2012 31 1.0
Wii 2013 12 1.0
Wii 2014 6 1.0
Wii 2015 4 1.0
Wii 2016 1 1.0
WiiU 2012 32 -4.0
WiiU 2013 42 1.0
WiiU 2014 31 1.0
WiiU 2015 28 1.0
WiiU 2016 14 1.0
X360 2005 18 -11.0
X360 2006 93 1.0
X360 2007 124 1.0
X360 2008 146 1.0
X360 2009 172 1.0
X360 2010 182 1.0
X360 2011 207 1.0
X360 2012 106 1.0
X360 2013 75 1.0
X360 2014 63 1.0
X360 2015 35 1.0
X360 2016 13 1.0
XB 2000 1 -16.0
XB 2001 33 1.0
XB 2002 159 1.0
XB 2003 190 1.0
XB 2004 177 1.0
XB 2005 180 1.0
XB 2006 62 1.0
XB 2007 3 1.0
XB 2008 1 1.0
XOne 2013 19 5.0
XOne 2014 61 1.0
XOne 2015 80 1.0
XOne 2016 87 1.0
In [19]:
year_games_info = year_games_info.query('diff != 1')
year_games_info[year_games_info.index.duplicated(keep=False)].query('count != 1')[['year_of_release','diff']]
Out[19]:
year_of_release diff
platform
DC 1998 -18.0
DS 2004 19.0
GB 1994 2.0
PC 1992 4.0
PC 1994 2.0

In the table above there are shown pairs of platform/difference between preivous game year of release:

  • There is a gap for DC games in 2002-2007
  • For DS the first game is from 1985, but the console was released in 2004. This row should be removed.
  • For GB data is missing for 1993 for some reason
  • For PC there are several gaps.

Wrong value for DS should be deleted:

In [20]:
games_df.drop((games_df.loc[(games_df['year_of_release'] == 1985) 
                            & (games_df['platform'] == "DS"), :]).index.item(), inplace=True)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: `item` has been deprecated and will be removed in a future version
  

Let's investigate ratings and scores:

In [22]:
len(games_df.query('rating != rating'))
Out[22]:
6763
In [23]:
have_rating = games_df.query('rating == rating').set_index('name')['rating'].to_dict()
games_df['rating'] = games_df.apply(lambda x: have_rating[x['name']] if x['name'] in have_rating.keys()
                                    else x['rating'], axis=1)
In [24]:
len(games_df.query('rating != rating'))
Out[24]:
6326

I've been able to fill ~400 missing ratings by finding a game rating on different platform. Let's fill NaNs the same way for critic and user score:

In [26]:
len(games_df.query('user_score != user_score'))
Out[26]:
9122
In [27]:
have_uscore = games_df.query('user_score == user_score').set_index('name')['user_score'].to_dict()
games_df['user_score'] = games_df.apply(lambda x: have_uscore[x['name']] if x['name'] in have_uscore.keys()
                                    else x['user_score'], axis=1)
In [28]:
len(games_df.query('user_score != user_score'))
Out[28]:
8023
In [29]:
len(games_df.query('critic_score != critic_score'))
Out[29]:
8575
In [30]:
have_cscore = games_df.query('critic_score == critic_score').set_index('name')['critic_score'].to_dict()
games_df['critic_score'] = games_df.apply(lambda x: have_cscore[x['name']] if x['name'] in have_cscore.keys()
                                    else x['critic_score'], axis=1)
In [31]:
len(games_df.query('critic_score != critic_score'))
Out[31]:
7614

About a thousand values for each score were restored.

In [32]:
games_df['total_sales'] = games_df['na_sales'] + games_df['eu_sales'] + games_df['jp_sales'] + games_df['other_sales']
games_df.head()
Out[32]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating total_sales
0 Wii Sports Wii 2006 Sports 41.36 28.96 3.77 8.45 76.0 8.0 E 82.54
1 Super Mario Bros. NES 1985 Platform 29.08 3.58 6.81 0.77 NaN NaN NaN 40.24
2 Mario Kart Wii Wii 2008 Racing 15.68 12.76 3.79 3.29 82.0 8.3 E 35.52
3 Wii Sports Resort Wii 2009 Sports 15.61 10.93 3.28 2.95 80.0 8.0 E 32.77
4 Pokemon Red/Pokemon Blue GB 1996 Role-Playing 11.27 8.89 10.22 1.00 NaN NaN NaN 31.38

Let's fill the missing values for critic/user scores with mean values for the same year/genre:

In [33]:
pivot_score = games_df.pivot_table(index='genre', columns='year_of_release', values=['critic_score', 'user_score'])
pivot_score
Out[33]:
critic_score ... user_score
year_of_release 0 1980 1981 1982 1985 1988 1989 1990 1991 1992 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
genre
Action 61.666667 NaN 87.0 NaN NaN NaN 56.0 NaN 95.0 NaN ... 6.559603 7.030675 6.798947 6.776761 6.492814 6.749650 7.185586 6.491870 6.795455 6.491304
Adventure 72.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 7.131250 6.425000 6.497368 7.176471 7.152000 7.220000 7.571429 6.605556 7.850000 7.208333
Fighting 75.200000 NaN NaN NaN NaN NaN NaN NaN NaN 85.0 ... 7.012500 7.243333 7.175000 7.470370 6.231579 6.625926 7.314286 6.946667 6.638889 6.554545
Misc 67.117647 69.0 NaN NaN NaN NaN NaN NaN NaN NaN ... 6.359649 6.374324 6.829508 6.078000 6.119565 7.100000 7.016667 7.061111 5.910526 7.691667
Platform 63.428571 NaN NaN NaN NaN NaN NaN NaN 60.5 NaN ... 6.875000 6.760870 6.745455 7.081818 7.562500 7.811111 7.167647 7.900000 7.263636 5.230769
Puzzle 82.250000 NaN NaN NaN 75.0 NaN NaN NaN NaN NaN ... 7.065385 7.582353 6.762500 6.811111 6.320000 8.200000 6.766667 8.600000 7.733333 NaN
Racing 66.666667 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 6.443284 6.570909 6.909434 6.697059 6.017949 6.412500 4.726667 6.024000 5.333333 6.835294
Role-Playing 68.857143 NaN NaN NaN NaN NaN NaN 77.0 NaN 58.0 ... 7.332143 7.628000 7.636667 7.394643 7.273171 6.970000 7.570588 6.866667 7.529167 7.528000
Shooter 67.181818 NaN NaN 57.0 NaN NaN NaN NaN NaN 85.0 ... 7.321333 6.925000 6.700000 6.663793 6.173077 6.764103 5.776471 6.612766 5.534615 6.497561
Simulation 66.571429 NaN NaN NaN 59.0 64.0 NaN NaN 64.0 62.0 ... 6.941304 7.191176 6.166667 6.332000 5.729167 7.490909 3.988889 6.500000 6.766667 6.581818
Sports 68.238095 NaN NaN NaN NaN NaN NaN NaN NaN 68.0 ... 6.608929 6.362712 6.504124 6.721649 6.119403 6.343902 5.155263 4.800000 4.351852 4.964286
Strategy 67.714286 NaN NaN NaN NaN NaN NaN NaN NaN 75.0 ... 7.327778 7.070833 7.280000 7.155000 6.529167 7.820000 3.241667 7.233333 6.850000 7.612500

12 rows × 67 columns

In [34]:
games_df['critic_score'] = (games_df.groupby(['genre',
                                'year_of_release'])['critic_score']
                                .transform(lambda group:
                                            group.fillna(group.mean())))
In [35]:
games_df['user_score'] = (games_df.groupby(['genre',
                                'year_of_release'])['user_score']
                                .transform(lambda group:
                                            group.fillna(group.mean())))
In [36]:
len(games_df.query('critic_score != critic_score'))
Out[36]:
623
In [37]:
len(games_df.query('user_score != user_score'))
Out[37]:
591

Almost all the missing values were filled.

Let's investigate rating/genre relations:

In [38]:
games_df.pivot_table(index='genre', columns='rating', values='name',
                     aggfunc='count').plot(kind='bar', style='o', figsize=(15,12))
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f582c9cc4d0>
In [39]:
games_df.query('genre == "Sports" and rating != rating').count()
Out[39]:
name               788
platform           788
year_of_release    788
genre              788
na_sales           788
eu_sales           788
jp_sales           788
other_sales        788
critic_score       619
user_score         619
rating               0
total_sales        788
dtype: int64
In [40]:
games_df.query('genre == "Puzzle" and rating != rating').count()
Out[40]:
name               229
platform           229
year_of_release    229
genre              229
na_sales           229
eu_sales           229
jp_sales           229
other_sales        229
critic_score       166
user_score         166
rating               0
total_sales        229
dtype: int64
In [41]:
games_df.query('genre == "Fighting" and rating != rating').count()
Out[41]:
name               390
platform           390
year_of_release    390
genre              390
na_sales           390
eu_sales           390
jp_sales           390
other_sales        390
critic_score       345
user_score         345
rating               0
total_sales        390
dtype: int64

The three genres above can be safely filled with most popular rating:

In [42]:
games_df.loc[(games_df['genre'] == 'Sports'),'rating'] = games_df.loc[(games_df['genre'] == 'Sports'),'rating'].fillna('E')
games_df.loc[(games_df['genre'] == 'Puzzle'),'rating'] = games_df.loc[(games_df['genre'] == 'Puzzle'),'rating'].fillna('E')
games_df.loc[(games_df['genre'] == 'Fighting'),'rating'] = games_df.loc[(games_df['genre'] == 'Fighting'),'rating'].fillna('T')

Conclusion

Score columns missing vales were filled with mean per genre/year; rating for some genres was filled with the most popular rating for that genre. 'year_of_release' column was cleared if weird values and some NaN's restored.


Step 3. Analysis

In [44]:
games_df[games_df['year_of_release'] != 0]['year_of_release'].hist(bins='auto', figsize=(12,8))
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f582cb51290>

On the histogram above we see the disdtribution of games released per year. It significally increases after the year 2000. It will be relevant to use only these years in the following research.

In [45]:
new_games_df = games_df.query('year_of_release >= 2000')

Let's take a look at sales on different platform:

In [46]:
sales_sum = (new_games_df.pivot_table(index='platform', values='total_sales', aggfunc='sum')
        .reset_index().sort_values('total_sales', ascending=False))
sales_sum.plot(y='total_sales', x='platform', kind='bar', figsize=(12,8))
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f582cb39f10>
In [48]:
top_sales_platform = sales_sum.query('total_sales > 400')['platform'].tolist()
top_sales_platform
Out[48]:
['PS2', 'X360', 'PS3', 'Wii', 'DS']

For these five platforms total sales more than twice higher than for others, with the most succesful PS2 pushing over 1.2 billion US dollars. Let's take a closer look:

In [49]:
new_games_df.query('platform == @top_sales_platform')
Out[49]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating total_sales
0 Wii Sports Wii 2006 Sports 41.36 28.96 3.77 8.45 76.000000 8.000000 E 82.54
2 Mario Kart Wii Wii 2008 Racing 15.68 12.76 3.79 3.29 82.000000 8.300000 E 35.52
3 Wii Sports Resort Wii 2009 Sports 15.61 10.93 3.28 2.95 80.000000 8.000000 E 32.77
6 New Super Mario Bros. DS 2006 Platform 11.28 9.14 6.50 2.88 89.000000 8.500000 E 29.80
7 Wii Play Wii 2006 Misc 13.96 9.18 2.93 2.84 58.000000 6.600000 E 28.91
... ... ... ... ... ... ... ... ... ... ... ... ...
16700 Mezase!! Tsuri Master DS DS 2009 Sports 0.00 0.00 0.01 0.00 69.750000 6.504124 E 0.01
16704 Plushees DS 2008 Simulation 0.01 0.00 0.00 0.00 64.105263 7.191176 E 0.01
16709 SCORE International Baja 1000: The Official Game PS2 2008 Racing 0.00 0.00 0.00 0.00 64.629630 6.570909 NaN 0.00
16710 Samurai Warriors: Sanada Maru PS3 2016 Action 0.00 0.00 0.01 0.00 69.471910 6.491304 NaN 0.01
16711 LMA Manager 2007 X360 2006 Sports 0.00 0.01 0.00 0.00 69.190000 7.361386 E 0.01

8081 rows × 12 columns

Almost a half of our data is for these 5 topsellers. Let's compare their sales per year on a graph:

In [50]:
top_sales_pivot = (new_games_df.query('platform == @top_sales_platform and year_of_release != 0')
                      .pivot_table(index=['platform','year_of_release'], values='total_sales', aggfunc='sum'))
years_list = top_sales_pivot.index.get_level_values(1).unique()
top_sales_pivot
Out[50]:
total_sales
platform year_of_release
DS 2004 17.27
2005 130.14
2006 119.81
2007 146.94
2008 145.32
2009 119.54
2010 85.02
2011 26.18
2012 11.01
2013 1.54
PS2 2000 39.17
2001 166.43
2002 205.38
2003 185.93
2004 220.55
2005 160.98
2006 106.46
2007 75.99
2008 53.90
2009 26.40
2010 5.64
2011 0.45
PS3 2006 20.96
2007 73.19
2008 118.52
2009 130.93
2010 142.17
2011 156.78
2012 107.36
2013 113.25
2014 47.76
2015 16.82
2016 3.60
Wii 2006 137.15
2007 152.77
2008 171.32
2009 206.97
2010 127.95
2011 59.65
2012 21.71
2013 8.59
2014 3.75
2015 1.14
2016 0.18
X360 2005 8.25
2006 51.62
2007 95.42
2008 135.26
2009 120.29
2010 170.03
2011 143.89
2012 99.74
2013 88.58
2014 34.74
2015 11.96
2016 1.52
In [51]:
figure(figsize=(12,8))
def build_graph(platform):
    data = top_sales_pivot.loc[platform]
    plt.plot(data.index, data.values)
    

for r in top_sales_platform:
    build_graph(r)
from matplotlib.pyplot import figure
plt.legend(top_sales_platform)
plt.xticks(years_list, rotation=20)
plt.grid(True)

Some common parameters we can see from the graph: a lifespan of each popular platform is around 10 years, so platforms released before 2005, like PS2 ans DS have 0 sales now. Total sales peaking 4-5 years after release - in this research we should look at platforms released in 2012 to build prognosis for 2017.

In [55]:
release_df = games_df.query('year_of_release != 0')

release_pivot = release_df.pivot_table(index='platform', 
                                     values='year_of_release', aggfunc='min').reset_index()

rp = release_df.pivot_table(index=['platform', 'year_of_release'], 
                                     values='total_sales', aggfunc='sum').query('year_of_release != 0')

release_pivot['total_sales'] = release_pivot.apply(lambda x: rp.loc[(x['platform'],x['year_of_release'])]
                                                                   ['total_sales'], axis=1)
release_pivot.set_index('platform', inplace=True)
In [56]:
figure(figsize=(12,8))
def build_graph(platform):
    data = release_pivot.loc[platform]
    plt.plot(data['year_of_release'], data['total_sales'], 'ro')
    plt.text(data['year_of_release']+0.1, data['total_sales']+1, platform, fontsize=9)

for r in release_pivot.index.get_level_values(0).unique():
    build_graph(r)
    
plt.xticks(release_pivot['year_of_release'].unique(), rotation=20)
plt.grid(True)

As we see, the new consoles come out after 2000 with gaps 2-4 years, the minor consoles come out 1-2 years before the new versions of big players - Xbox and PlayStation. We should not expect the new consoles in 2017.

In [57]:
newest_df = new_games_df.query('year_of_release >= 2012')
newest_pivot = newest_df.pivot_table(index=['platform','year_of_release'], values='total_sales', aggfunc='sum')
years_list = newest_pivot.index.get_level_values(1).unique()
newest_pivot
Out[57]:
total_sales
platform year_of_release
3DS 2012 51.36
2013 56.57
2014 43.76
2015 27.78
2016 15.14
DS 2012 11.01
2013 1.54
PC 2012 23.22
2013 12.73
2014 13.28
2015 8.52
2016 5.25
PS3 2012 107.36
2013 113.25
2014 47.76
2015 16.82
2016 3.60
PS4 2013 25.99
2014 100.00
2015 118.90
2016 69.25
PSP 2012 7.69
2013 3.14
2014 0.24
2015 0.12
PSV 2012 16.19
2013 10.59
2014 11.90
2015 6.25
2016 4.25
Wii 2012 21.71
2013 8.59
2014 3.75
2015 1.14
2016 0.18
WiiU 2012 17.56
2013 21.65
2014 22.03
2015 16.35
2016 4.60
X360 2012 99.74
2013 88.58
2014 34.74
2015 11.96
2016 1.52
XOne 2013 18.96
2014 54.07
2015 60.14
2016 26.15
In [58]:
figure(figsize=(12,8))
def bg(platform):
    data = newest_pivot.loc[platform]
    plt.plot(data.index, data.values)
    

for r in newest_pivot.index.get_level_values(0).unique():
    bg(r)
from matplotlib.pyplot import figure
plt.legend(newest_pivot.index.get_level_values(0).unique())
plt.xticks(years_list, rotation=20)
plt.grid(True)
In [59]:
newest_mean = newest_pivot.query('year_of_release == 2016')['total_sales'].mean()
newest_pivot.query('year_of_release == 2016 and total_sales >= @newest_mean')
Out[59]:
total_sales
platform year_of_release
3DS 2016 15.14
PS4 2016 69.25
XOne 2016 26.15

The most popular consoles in 2016 are Xbox One and Playstation 4. Although there were no sales rising in 2016, we should expcect the highest sales on these two consoles. Also 3DS had sales slightly above average in 2016, but it's lifespan is close to end, so we should expect it die out in a year or two, but for 2017 it still may be profitable.

In [60]:
grouped_by_platf = new_games_df.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
grouped_by_platf
Out[60]:
platform year_of_release total_sales
0 3DS 2011 63.20
1 3DS 2012 51.36
2 3DS 2013 56.57
3 3DS 2014 43.76
4 3DS 2015 27.78
5 3DS 2016 15.14
6 DC 2000 5.99
7 DC 2001 1.07
8 DC 2002 0.29
9 DC 2007 0.02
10 DC 2008 0.04
11 DS 2004 17.27
12 DS 2005 130.14
13 DS 2006 119.81
14 DS 2007 146.94
15 DS 2008 145.32
16 DS 2009 119.54
17 DS 2010 85.02
18 DS 2011 26.18
19 DS 2012 11.01
20 DS 2013 1.54
21 GB 2000 19.76
22 GB 2001 9.24
23 GBA 2000 0.07
24 GBA 2001 61.53
25 GBA 2002 74.16
26 GBA 2003 56.67
27 GBA 2004 77.91
28 GBA 2005 33.86
29 GBA 2006 5.28
30 GBA 2007 3.40
31 GC 2001 26.34
32 GC 2002 51.81
33 GC 2003 50.91
34 GC 2004 28.82
35 GC 2005 27.62
36 GC 2006 11.26
37 GC 2007 0.27
38 N64 2000 33.97
39 N64 2001 3.25
40 N64 2002 0.08
41 PC 2000 4.66
42 PC 2001 5.47
43 PC 2002 8.57
44 PC 2003 8.84
45 PC 2004 10.39
46 PC 2005 4.37
47 PC 2006 2.85
48 PC 2007 9.28
49 PC 2008 12.42
50 PC 2009 16.91
51 PC 2010 24.28
52 PC 2011 35.03
53 PC 2012 23.22
54 PC 2013 12.73
55 PC 2014 13.28
56 PC 2015 8.52
57 PC 2016 5.25
58 PS 2000 96.37
59 PS 2001 35.59
60 PS 2002 6.67
61 PS 2003 2.07
62 PS2 2000 39.17
63 PS2 2001 166.43
64 PS2 2002 205.38
65 PS2 2003 185.93
66 PS2 2004 220.55
67 PS2 2005 160.98
68 PS2 2006 106.46
69 PS2 2007 75.99
70 PS2 2008 53.90
71 PS2 2009 26.40
72 PS2 2010 5.64
73 PS2 2011 0.45
74 PS3 2006 20.96
75 PS3 2007 73.19
76 PS3 2008 118.52
77 PS3 2009 130.93
78 PS3 2010 142.17
79 PS3 2011 156.78
80 PS3 2012 107.36
81 PS3 2013 113.25
82 PS3 2014 47.76
83 PS3 2015 16.82
84 PS3 2016 3.60
85 PS4 2013 25.99
86 PS4 2014 100.00
87 PS4 2015 118.90
88 PS4 2016 69.25
89 PSP 2004 7.06
90 PSP 2005 43.84
91 PSP 2006 55.32
92 PSP 2007 46.93
93 PSP 2008 34.56
94 PSP 2009 38.58
95 PSP 2010 35.04
96 PSP 2011 17.82
97 PSP 2012 7.69
98 PSP 2013 3.14
99 PSP 2014 0.24
100 PSP 2015 0.12
101 PSV 2011 4.63
102 PSV 2012 16.19
103 PSV 2013 10.59
104 PSV 2014 11.90
105 PSV 2015 6.25
106 PSV 2016 4.25
107 WS 2000 0.68
108 WS 2001 0.28
109 Wii 2006 137.15
110 Wii 2007 152.77
111 Wii 2008 171.32
112 Wii 2009 206.97
113 Wii 2010 127.95
114 Wii 2011 59.65
115 Wii 2012 21.71
116 Wii 2013 8.59
117 Wii 2014 3.75
118 Wii 2015 1.14
119 Wii 2016 0.18
120 WiiU 2012 17.56
121 WiiU 2013 21.65
122 WiiU 2014 22.03
123 WiiU 2015 16.35
124 WiiU 2016 4.60
125 X360 2005 8.25
126 X360 2006 51.62
127 X360 2007 95.42
128 X360 2008 135.26
129 X360 2009 120.29
130 X360 2010 170.03
131 X360 2011 143.89
132 X360 2012 99.74
133 X360 2013 88.58
134 X360 2014 34.74
135 X360 2015 11.96
136 X360 2016 1.52
137 XB 2000 0.99
138 XB 2001 22.26
139 XB 2002 48.59
140 XB 2003 55.40
141 XB 2004 65.42
142 XB 2005 49.23
143 XB 2006 10.04
144 XB 2007 0.55
145 XB 2008 0.18
146 XOne 2013 18.96
147 XOne 2014 54.07
148 XOne 2015 60.14
149 XOne 2016 26.15
In [61]:
order_by_platf = grouped_by_platf.groupby(['platform'])['total_sales'].sum().sort_values(ascending=False).reset_index()['platform']
order_by_platf
Out[61]:
0      PS2
1     X360
2      PS3
3      Wii
4       DS
5      PS4
6      GBA
7      PSP
8      3DS
9       XB
10      PC
11      GC
12    XOne
13      PS
14    WiiU
15     PSV
16     N64
17      GB
18      DC
19      WS
Name: platform, dtype: object
In [62]:
plt.figure(figsize=(12,8))
sns.boxplot(x='platform', y='total_sales', data=grouped_by_platf, order=order_by_platf)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5825c32e10>

There is a great difference in total sales between platforms. There are five leaders (PS2, X360, PS3, Wii, DS), all others are either minor platforms or new ones and still have to reach their peak, like PS4.

In [63]:
grouped_by_mean = new_games_df.groupby(['platform', 'year_of_release'])['total_sales'].mean().reset_index()
order_by_platf_mean = grouped_by_mean.groupby(['platform'])['total_sales'].mean().sort_values(ascending=False).reset_index()['platform']
plt.figure(figsize=(12,8))
sns.boxplot(x='platform', y='total_sales', data=grouped_by_mean, order=order_by_platf_mean, showmeans=True)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5825c16b50>

The difference between mean sales is not that great. The mean sales have different leaders, for instance PS4 in in the lead, which means it'll join the topsellers in a couple of years. Mean values are great for several outdated platforms like GameBoy, which was very populat at it's time. The newest platforms have yet to rich it's popularity.


Let's take a look at the most popular consoles of previous generation:

In [64]:
new_games_df.query('name == "Call of Duty: Modern Warfare 3"')
Out[64]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating total_sales
29 Call of Duty: Modern Warfare 3 X360 2011 Shooter 9.04 4.24 0.13 1.32 70.0 1.8 M 14.73
37 Call of Duty: Modern Warfare 3 PS3 2011 Shooter 5.54 5.73 0.49 1.57 70.0 1.8 M 13.33
1040 Call of Duty: Modern Warfare 3 PC 2011 Shooter 0.41 0.98 0.00 0.32 70.0 1.8 M 1.71
2500 Call of Duty: Modern Warfare 3 Wii 2011 Shooter 0.55 0.20 0.00 0.08 70.0 1.8 M 0.83
In [65]:
top_sales_platform = ['X360', 'PS3', 'PC', 'Wii']

We pick a topseller title and check on which consoles it came out, those should be the most popular for it's time.

Let's have a look at PS3 sales dependance of critic and user scores:

In [66]:
ps3_sales = new_games_df.query('platform == "PS3"')
In [67]:
ax = ps3_sales.plot(y='total_sales', x='critic_score', kind='scatter', alpha=0.5, figsize=(12,8), label='critic', ylim=(0,5))

We see the dots location is curved upwards with critic score rise, which means overall sales rise with critic score. Most of the games get 60-80 points from critics, most of them do not reach $1 million sales. For those who get more than 80 from critics, the sales are spread more and may reach several millions more easily.
Also if the game get less than 60 from critics, fat chance it'll have great sales.

In [68]:
ax = ps3_sales.plot(y='total_sales', x='user_score', kind='scatter', alpha=0.5, figsize=(12,8), label='user', ylim=(0,5))

The majority of the games get 6 to 8 points. User score/sales relations are more scattered, with more low scored titles between topsellers. It means that probably the games were not that good, but proper advertising made them topsellers.

In [69]:
corr_check = ps3_sales[['critic_score', 'user_score', 'na_sales', 'eu_sales', 'jp_sales', 'total_sales']]
corr_check.corr()
Out[69]:
critic_score user_score na_sales eu_sales jp_sales total_sales
critic_score 1.000000 0.514816 0.348118 0.289852 0.196666 0.336766
user_score 0.514816 1.000000 0.102047 0.024210 0.129981 0.074040
na_sales 0.348118 0.102047 1.000000 0.813041 0.344322 0.936839
eu_sales 0.289852 0.024210 0.813041 1.000000 0.374296 0.958120
jp_sales 0.196666 0.129981 0.344322 0.374296 1.000000 0.458046
total_sales 0.336766 0.074040 0.936839 0.958120 0.458046 1.000000
In [70]:
sns.heatmap(corr_check.corr())
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5825d7bfd0>

On the correlation map above we see that critic score have more influence on total sales, than user scores (possibly because critics write their rewiews often before the game officially released, and users only after they bought the game). European sales are most influenced by critic reviews. User reviews have almost no effect on sales, the least is on users in North America.


In [71]:
top_on_all_platf= (new_games_df.query('platform == @top_sales_platform')
                .groupby('name').count().sort_values(by='platform', ascending=False)).reset_index().head(20)['name'].to_list()

top_on_all_platf
Out[71]:
['Need for Speed: Most Wanted',
 'Mini Ninjas',
 'Pro Evolution Soccer 2013',
 'Tomb Raider: Underworld',
 'Star Wars: The Force Unleashed II',
 'Kung Fu Panda',
 'Star Wars The Clone Wars: Republic Heroes',
 'Spider-Man: Web of Shadows',
 'Spider-Man 3',
 'Prince of Persia: The Forgotten Sands',
 'Transformers: Revenge of the Fallen (XBox 360, PS3, & PC Versions)',
 'Pro Evolution Soccer 2012',
 'Ice Age: Dawn of the Dinosaurs',
 'LEGO Indiana Jones 2: The Adventure Continues',
 'Ratatouille',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Deathly Hallows - Part 2',
 'Harry Potter and the Deathly Hallows - Part 1',
 "Skylanders: Spyro's Adventure"]

Above we have the list of top 20 games that have been on all top platforms we take a look at.

Let's compare some of the top game titles sales for different consoles:

In [72]:
ps3_sales = new_games_df.query('platform == "PS3" and name == @top_on_all_platf')
x360_sales = new_games_df.query('platform == "X360" and name == @top_on_all_platf')
pc_sales = new_games_df.query('platform == "PC" and name == @top_on_all_platf')
wii_sales = new_games_df.query('platform == "Wii" and name == @top_on_all_platf')
In [73]:
ps3_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'],  autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
Out[73]:
<matplotlib.legend.Legend at 0x7f5825e16b90>
In [74]:
pc_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'],  autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
Out[74]:
<matplotlib.legend.Legend at 0x7f5825e8a350>
In [75]:
x360_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'],  autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
Out[75]:
<matplotlib.legend.Legend at 0x7f5825ec9a50>
In [76]:
wii_sales.plot(y='total_sales', x='name', kind='pie', figsize=(18,8), labels=pc_sales['name'],  autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2,1), loc="upper right", fontsize=10)
Out[76]:
<matplotlib.legend.Legend at 0x7f5825fb8c10>

The list and order of top selling games for different platforms is about the same, with minor differences. But sales for the same game i=on different platforms were different, for example "Harry Potter and the Deadly Hallows part 1" had way more success on PS3, than on other platforms.


Let's have a look on sales distributions between genres:

In [77]:
new_games_df.pivot_table(index='genre',  values='total_sales',
                     aggfunc='sum').sort_values('total_sales', ascending=False).plot(kind='bar', style='o', figsize=(15,12))
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58254edbd0>
In [78]:
new_games_df.pivot_table(index='genre',  values='total_sales',
                     aggfunc='sum').sort_values('total_sales', ascending=False).plot(y='total_sales', kind='pie', style='o', figsize=(10,7), legend=False)
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5825466190>

The first three genres - Action, Sports and Shooter - are the most profitable, making a half of the sales. The more popular the genre, the more recreational and entertaining it is. The least popular ones, Strategy, Puzzle and Adventure might be less entertaining and require some thinking, planning and riddle solving. Also playing these games may require more spare time.

In [79]:
new_games_df.pivot_table(index='genre', values=['na_sales', 'eu_sales', 'jp_sales'],
                     aggfunc='sum').sort_values('na_sales', ascending=False).plot(kind='bar', style='o', figsize=(15,12))
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f58253f8290>

The genre distribution between thebiggest markets looks even with a couple of exclusion: RPG's are extremely popular in Japan, it is the most profitable genre there. All the genres are more profitable there, than Shooter genre, which is in top-3 for NA and EU. Racing games are also unpopular in Japan.

Conclusion

In the analysis we should use the data from the year 2000 and later, when the industry started it's rapid growth and some game rules were changed. We found the top 5 best selling platforms - PS2, X360, PS3, Wii and DS. The successful platform bring profit for 10 years, peaking at 5 years after the release and than the profit starts to fade. According to this information, we pay attention to the platforms, released not earlier than 2012. There is no console on the market that increased the sales in 2016, but the aim should be on the top 3 leaders - Xbox One, PS4 and 3DS. As 3DS sale are fading, we may expect the release of new console from Nintendo, but not for other big players.
There is a correlation between critic ratings and sales, so the games with top scored reviews from critics should have higher sales, especially on European market. Same games on different consoles might have a different success. The most profit bring Action, Sports and Shooter genres, but the situation on Japan market is different, the most profitable genre there is RPG., the least - Shooter.


Step 4. Profile for each region.

In [80]:
def pivot_profile(reg, agr):
    sales = reg + '_sales'
    piv_t = new_games_df.pivot_table(index=agr, values=sales,
                     aggfunc='sum').sort_values(sales, ascending=False).head(5).reset_index()
    return piv_t
reg_lst = ['na', 'eu', 'jp']
agr_lst = ['platform', 'genre', 'rating']
agr_df = pd.DataFrame()
reg_df = pd.DataFrame()
for r in reg_lst:
    #del dfff
    agr_df = pd.DataFrame()
    for a in agr_lst:
        agr_df = pd.concat([agr_df, pivot_profile(r, a)], axis=1)
    agr_df.columns = ['by_platform', 'pl_sales', 'by_genre', 'genre_sales', 'by_rating', 'rating_sales']
    agr_df['region'] = r
    reg_df = pd.concat([reg_df, agr_df], axis=0)
reg_df
Out[80]:
by_platform pl_sales by_genre genre_sales by_rating rating_sales region
0 X360 595.74 Action 752.33 E 1334.90 na
1 PS2 580.30 Sports 595.84 T 749.23 na
2 Wii 486.87 Shooter 495.23 M 709.59 na
3 PS3 390.13 Misc 365.27 E10+ 382.69 na
4 DS 380.32 Role-Playing 282.30 EC 2.04 na
0 PS2 337.00 Action 466.83 E 773.08 eu
1 PS3 327.21 Sports 342.96 M 468.75 eu
2 X360 268.37 Shooter 294.02 T 416.29 eu
3 Wii 258.32 Misc 196.71 E10+ 201.65 eu
4 DS 188.41 Racing 192.52 EC 0.11 eu
0 DS 175.00 Role-Playing 241.71 E 238.53 jp
1 PS2 137.63 Action 132.23 T 160.54 jp
2 3DS 100.62 Misc 85.43 M 58.62 jp
3 PS3 79.41 Sports 76.19 E10+ 40.09 jp
4 PSP 76.14 Platform 55.27 EC 0.00 jp
In [81]:
for r in reg_lst:
    df = reg_df.query('region == @r')
    df.plot(x='by_platform', y='pl_sales', kind='barh', figsize=(7,4), fontsize=10, legend=False)
    plt.ylabel('')
    plt.title(r.upper(), fontsize=15)

XBox 360 holds first place only in North America, for Europe it is still PS2 (about to be beaten by PS3 in 2017), and Xbox 360 is much less popular there. In Japan the market is completely different, only Japan-made consoles are in top 5. Sony and Nintendo dominate the japan market.

In [82]:
for r in reg_lst:
    df = reg_df.query('region == @r')
    df.plot(x='by_genre', y='genre_sales', kind='barh', figsize=(7,4), 
            fontsize=10, legend=False, color="orange")
    plt.ylabel('')
    plt.title(r.upper(), fontsize=15)

NA and EU markets are very similar with exception, that 5th place in Europe is Racing genre, not RPG as in NA. Japan market is different, the absolute winner there is RPG, and Shooter is not represented at all. Also, Platform genre is presented there on 5th position.

In [83]:
for r in reg_lst:
    df = reg_df.query('region == @r')
    df.plot(x='by_rating', y='rating_sales', kind='pie', figsize=(7,4), 
            labels=df['by_rating'], autopct='%1.1f%%', fontsize=10, legend=False)
    plt.ylabel('')
    plt.title(r.upper(), fontsize=15)

NA and EU have similar sales by rating, with rating M slightly bigger for Europe wiining it's second place. For Japan the distribution is different - almost a half is E rated games, a one third is for rating T, and tiny 12% for M rating games - looks like adults in Japan have very little time to play videogames.

In [84]:
na_s = new_games_df['na_sales'].sum()
eu_s = new_games_df['eu_sales'].sum()
jp_s = new_games_df['jp_sales'].sum()
df = [na_s, eu_s, jp_s]
df = pd.DataFrame(df)
df.columns = ['sales']
df['region'] = reg_lst
df.plot(x='region', y='sales', kind='pie', figsize=(7,4), 
            labels=df['region'], autopct='%1.1f%%', fontsize=10, legend=False)
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5825012a90>

Conclusion

In general, North America market and European market have only minor differences, most of them in popular consoles. Japan market is unique in all the parameters reviewed an should have individual approach. We also should keep in mind that NA market is almost twice bigger than EU, and Japan market is almost twice smaller, than EU.


Step 5. Testing theories

XBox One and PC user ratings
In [85]:
xone_df = new_games_df.query('platform == "XOne"')
pc_df = new_games_df.query('platform == "PC" and year_of_release >= 2013')
xone_pivot = xone_df.pivot_table(index='year_of_release', values='user_score')
pc_pivot = pc_df.pivot_table(index='year_of_release', values='user_score')
print('Xbox One std:', xone_df['user_score'].std(), '\nPC std:', pc_df['user_score'].std())
Xbox One std: 1.662151763417649 
PC std: 1.6758507947279013

We should compare the samem years for the two platform since PC exists many years and Xbox on was released in 2013.
We consider standart deviations for the two platforms are the same.

In [86]:
ax = xone_pivot.plot(style='-o', grid=True)
pc_pivot.plot(style='-o', grid=True, ax=ax)
ax.legend(['XOne', 'PC'])
Out[86]:
<matplotlib.legend.Legend at 0x7f5825090c50>

User score average looks approximately the same for the two platforms, with only big difference from the release year for XBox One.

Testing:

Let's compare the sample means for user score:
$H_{0}$ - the sample means have no difference.
$H_{a}$ - the sample means are different.

In [87]:
alpha = .05

results = st.stats.ttest_ind(xone_pivot['user_score'], pc_pivot['user_score'], equal_var=True)

print('p-value: ', results.pvalue)

if (results.pvalue < alpha):
        print("We reject the null hypothesis")
else:
        print("We can't reject the null hypothesis")
p-value:  0.4686870031236776
We can't reject the null hypothesis
Conclusion:

We have no ground to say that sample means for user score are different for XBox On and PC, they most probably are the same.


Action and Sport games user rating
In [88]:
action_df = new_games_df.query('genre == "Action"')
sports_df = new_games_df.query('genre == "Sports"')
action_pivot = action_df.pivot_table(index='year_of_release', values='user_score')
sports_pivot = sports_df.pivot_table(index='year_of_release', values='user_score')
print('Action std:', action_df['user_score'].std(), '\nSports std:', sports_df['user_score'].std())
Action std: 1.2402060915789985 
Sports std: 1.46482910279134

We may consider standart deviations equal.

In [89]:
ax = action_pivot.plot(style='-o', grid=True)
sports_pivot.plot(style='-o', grid=True, ax=ax)
ax.legend(['Action', 'Sports'])
Out[89]:
<matplotlib.legend.Legend at 0x7f58250b9750>

The two graphs above do look different, but let's assume that the difference is just a sampling error.

Testing:

Let's compare the sample means for user score:
$H_{0}$ - the sample means have no difference (the one in the graph is due to sampling error).
$H_{a}$ - the sample means are different.

In [90]:
alpha = .05

results = st.stats.ttest_ind(action_pivot['user_score'], sports_pivot['user_score'], equal_var=True)

print('p-value: ', results.pvalue)

if (results.pvalue < alpha):
        print("We reject the null hypothesis")
else:
        print("We can't reject the null hypothesis")
p-value:  0.18713077092879427
We can't reject the null hypothesis
Conclusion:

It looks like there is no big difference between average user scores for Action and Sport games. For the latest years the difference is bigger, but still might be because of sample error.


Step 6. Overall conclusion

The data contained about 16000 entries about the video game sales for different platforms and markets. After the analysis conducted we may determine the possible top seller platforms and genres for each market. The sales for 3DS, Xbox One and PS4 look promising, but we should also expect a new console from Nintendo, as 3DS sales fade away. The top genres are Action, Shooter, Sports, RPG. For the japanese market everything is different, as they prefer Japan-made consoles and RPG is on the first row in genre chart there. We should pay attention to critic scores for games, as they directly correlate to sales, especially in Europe. No dependece between user ratings for different genres or consoles was proved.