This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com. Fields include: Rank - Ranking of overall sales Name - The games name Platform - Platform of the games release (i.e. PC,PS4, etc.) Year - Year of the game's release Genre - Genre of the game Publisher - Publisher of the game NA_Sales - Sales in North America (in millions) EU_Sales - Sales in Europe (in millions) JP_Sales - Sales in Japan (in millions) Other_Sales - Sales in the rest of the world (in millions) Global_Sales - Total worldwide sales.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
from scipy.stats import mode


df = pd.read_csv('/Users/s7c/Documents/Video_Games_Sales_as_at_22_Dec_2016.csv')

Exploring our data:

In [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
Name               16717 non-null object
Platform           16719 non-null object
Year_of_Release    16450 non-null float64
Genre              16717 non-null object
Publisher          16665 non-null object
NA_Sales           16719 non-null float64
EU_Sales           16719 non-null float64
JP_Sales           16719 non-null float64
Other_Sales        16719 non-null float64
Global_Sales       16719 non-null float64
Critic_Score       8137 non-null float64
Critic_Count       8137 non-null float64
User_Score         10015 non-null object
User_Count         7590 non-null float64
Developer          10096 non-null object
Rating             9950 non-null object
dtypes: float64(9), object(7)
memory usage: 2.0+ MB
In [3]:
df.describe()
Out[3]:
Year_of_Release NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count User_Count
count 16450.000000 16719.000000 16719.000000 16719.000000 16719.000000 16719.000000 8137.000000 8137.000000 7590.000000
mean 2006.487356 0.263330 0.145025 0.077602 0.047332 0.533543 68.967679 26.360821 162.229908
std 5.878995 0.813514 0.503283 0.308818 0.186710 1.547935 13.938165 18.980495 561.282326
min 1980.000000 0.000000 0.000000 0.000000 0.000000 0.010000 13.000000 3.000000 4.000000
25% 2003.000000 0.000000 0.000000 0.000000 0.000000 0.060000 60.000000 12.000000 10.000000
50% 2007.000000 0.080000 0.020000 0.000000 0.010000 0.170000 71.000000 21.000000 24.000000
75% 2010.000000 0.240000 0.110000 0.040000 0.030000 0.470000 79.000000 36.000000 81.000000
max 2020.000000 41.360000 28.960000 10.220000 10.570000 82.530000 98.000000 113.000000 10665.000000

We can see that North America sales have the highest mean and standard deviation.

In [4]:
df.corr()
Out[4]:
Year_of_Release NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count User_Count
Year_of_Release 1.000000 -0.092562 0.003842 -0.168386 0.037700 -0.076433 0.011411 0.223407 0.175339
NA_Sales -0.092562 1.000000 0.765336 0.449598 0.638654 0.941010 0.240755 0.295413 0.246429
EU_Sales 0.003842 0.765336 1.000000 0.435068 0.722796 0.901239 0.220752 0.277533 0.283360
JP_Sales -0.168386 0.449598 0.435068 1.000000 0.291096 0.612300 0.152593 0.180219 0.075638
Other_Sales 0.037700 0.638654 0.722796 0.291096 1.000000 0.749242 0.198554 0.251639 0.238982
Global_Sales -0.076433 0.941010 0.901239 0.612300 0.749242 1.000000 0.245471 0.303571 0.265012
Critic_Score 0.011411 0.240755 0.220752 0.152593 0.198554 0.245471 1.000000 0.425504 0.264376
Critic_Count 0.223407 0.295413 0.277533 0.180219 0.251639 0.303571 0.425504 1.000000 0.362334
User_Count 0.175339 0.246429 0.283360 0.075638 0.238982 0.265012 0.264376 0.362334 1.000000

Europe sales are strong correlated with North America sales and from the rest of the world. Moderatly strong correlation we can see between Japan and the rest of the world.

In [5]:
df.head()
Out[5]:
Name Platform Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count User_Score User_Count Developer Rating
0 Wii Sports Wii 2006.0 Sports Nintendo 41.36 28.96 3.77 8.45 82.53 76.0 51.0 8 322.0 Nintendo E
1 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24 NaN NaN NaN NaN NaN NaN
2 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.68 12.76 3.79 3.29 35.52 82.0 73.0 8.3 709.0 Nintendo E
3 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.61 10.93 3.28 2.95 32.77 80.0 73.0 8 192.0 Nintendo E
4 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37 NaN NaN NaN NaN NaN NaN

Let's check for missing values:

In [6]:
sns.heatmap(df.isnull(),cbar=False,cmap='coolwarm',yticklabels=False)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x24522bd5438>

We can observe that Critic_Score, Critic_Count, User_Score, User_Count, Developer, Rating have most of the mssing values. Year_of_release have just few values missing

percent_missing = df[['Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 'Developer', 'Year_of_Release', 'Rating']].isnull().sum() * 100 / len(df) percent_missing

In the Excel pivot tables from down below we found that thes ame 17 of the platforms doesn't have at all critic score, critic count, user count or user score. If innitialy I was looking to replace those missing values with the average of the platform, later I decided is not the smartest idea sawing that the all the genres from that platform have missing values. So, the missing values are going to be replaced by the average based by genre...

In [7]:
df["Critic_Score"] = df.groupby("Genre")['Critic_Score'].apply(lambda x: x.fillna(x.mean()))
df["Critic_Count"] = df.groupby("Genre")['Critic_Count'].apply(lambda x: x.fillna(x.mean()))

# we could see that User_Score is an object type so in order to operate the mean we need to change that to numeric
df["User_Score"] = pd.to_numeric(df["User_Score"], errors='coerce')

df["User_Score"] = df.groupby("Genre")['User_Score'].apply(lambda x: x.fillna(x.mean()))
df["User_Count"] = df.groupby("Genre")['User_Count'].apply(lambda x: x.fillna(x.mean()))

# because I cannot take the average of the years and replace the NaN being a float and I cannot convert to integer with nan, I will
# just drop the rows with NaN

df.drop('Developer', axis=1, inplace=True)

Being a categorical column, Rating column is gonna have the missing values replaced with the mode or the most common value:

In [8]:
df.Rating.mode()
df.Rating.fillna(df['Rating'].mode()[0], inplace=True)

Export the df file for further analysis:

In [9]:
df.to_csv('video.csv')