Este notebook lo he creado para hacer pruebas de visualización de datos. Como hasta ahora solamente había hecho cursos y ejercicios guiados he creado este notebook para ir respondiendo preguntas que me hago sobre un dataset por mi cuenta sin supervisión.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('./'):
for filename in filenames:
print(os.path.join(dirname, filename))
./videogames-sales.ipynb ./.ipynb_checkpoints/Untitled-checkpoint.ipynb ./.ipynb_checkpoints/videogames-sales-checkpoint.ipynb ./input/vgsales.csv ./input/.ipynb_checkpoints/vgsales-checkpoint.csv
filepath='./input/vgsales.csv'
df = pd.read_csv(filepath, index_col='Rank')
df.shape
(16598, 10)
df.head()
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|
Rank | ||||||||||
1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
platforms = ""
for platform in df['Platform'].unique():
platforms += platform + ", "
print(platforms[:-2]) # -2 in order to remove the last ',' character
Wii, NES, GB, DS, X360, PS3, PS2, SNES, GBA, 3DS, PS4, N64, PS, XB, PC, 2600, PSP, XOne, GC, WiiU, GEN, DC, PSV, SAT, SCD, WS, NG, TG16, 3DO, GG, PCFX
wii_games = df[df['Platform'] == 'Wii']
sales_per_year = wii_games.groupby('Year').Global_Sales.sum().to_frame()
sns.lineplot(x=sales_per_year.index, y='Global_Sales', data=sales_per_year)
plt.title('Total Wii sales')
Text(0.5, 1.0, 'Total Wii sales')
sales_per_year_and_platform = df.groupby(['Year', 'Platform']).Global_Sales.sum().to_frame().reset_index()
plt.figure(figsize=(20,10))
plt.title('Sales per year and platform')
plt.xlabel('Year')
plt.ylabel('Sales')
sns.lineplot(data=sales_per_year_and_platform, x='Year', y='Global_Sales', hue='Platform')
<AxesSubplot:title={'center':'Sales per year and platform'}, xlabel='Year', ylabel='Sales'>
La idea de esta grafica era buena. Ver una comparativa de ventas de todas las plataformas a lo largo del tiempo pero una vez calculado no se puede diferenciar cada plataforma. Los colores de la leyenda se repiten porque son muchas plataformas.
A pesar de esto, podemos observar que hubo tres puntos donde despuntaron las ventas, aproximadamente: 1997, entre 2002 y 2005 y 2010
Los mapas de calor necesitan un conjunto de datos el cual todo su contenido sean números para poder comparar sus valores y asignar colores en función de esto. Dónde único se puede tener valores no numéricos es en el índice y la cabecera.
heatmap_data = sales_per_year_and_platform.pivot(columns='Year', index='Platform', values='Global_Sales').fillna(0)
heatmap_data.head()
Year | 1980.0 | 1981.0 | 1982.0 | 1983.0 | 1984.0 | 1985.0 | 1986.0 | 1987.0 | 1988.0 | 1989.0 | ... | 2009.0 | 2010.0 | 2011.0 | 2012.0 | 2013.0 | 2014.0 | 2015.0 | 2016.0 | 2017.0 | 2020.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Platform | |||||||||||||||||||||
2600 | 11.38 | 35.77 | 28.86 | 5.83 | 0.27 | 0.45 | 0.66 | 1.98 | 0.75 | 0.62 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.0 | 0.00 |
3DO | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.0 | 0.00 |
3DS | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 62.53 | 51.14 | 55.88 | 43.14 | 26.99 | 6.6 | 0.0 | 0.00 |
DC | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.0 | 0.00 |
DS | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 121.99 | 87.98 | 27.80 | 11.64 | 1.96 | 0.02 | 0.00 | 0.0 | 0.0 | 0.29 |
5 rows × 39 columns
plt.figure(figsize=(60,20))
plt.title('Sales per year and platform')
plt.xlabel('Year')
plt.ylabel('Sales')
sns.heatmap(heatmap_data, annot=True)
<AxesSubplot:title={'center':'Sales per year and platform'}, xlabel='Year', ylabel='Platform'>
Como podemos ver en el gráfico anterior el groso de ventas de videojuegos ha sido por parte de las plataformas Wii, DS, X360, PS3, PS2, PS, N64, 3DS. Además de esto, vemos como estas ventas se centran desde el año 1995 hasta el 2017.
Por esta razón vamos a hacer un nuevo mapa de calor solamente con estos datos.
reduced_sales_per_year_and_platform = sales_per_year_and_platform.query('Year>1994 and Year<2018 and (Platform == "Wii" or Platform == "DS" or Platform == "X360" or Platform == "PS3" or Platform == "PS2" or Platform == "PS" or Platform == "N64" or Platform == "3DS")')
heatmap_data = reduced_sales_per_year_and_platform.pivot(columns='Year', index='Platform', values='Global_Sales').fillna(0)
heatmap_data.head()
Year | 1995.0 | 1996.0 | 1997.0 | 1998.0 | 1999.0 | 2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | ... | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | 2012.0 | 2013.0 | 2014.0 | 2015.0 | 2016.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Platform | |||||||||||||||||||||
3DS | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 62.53 | 51.14 | 55.88 | 43.14 | 26.99 | 6.6 |
DS | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 17.46 | ... | 149.36 | 147.89 | 121.99 | 87.98 | 27.80 | 11.64 | 1.96 | 0.02 | 0.00 | 0.0 |
N64 | 0.00 | 34.11 | 39.51 | 49.28 | 57.96 | 34.01 | 3.26 | 0.08 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
PS | 35.92 | 94.68 | 136.08 | 169.58 | 144.57 | 96.28 | 35.52 | 6.69 | 2.05 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
PS2 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 39.11 | 166.43 | 205.40 | 184.29 | 211.78 | ... | 76.00 | 53.83 | 26.45 | 5.63 | 0.47 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
5 rows × 22 columns
heatmap_data['total_sales_to_sort'] = heatmap_data.T.sum()
heatmap_data = heatmap_data.sort_values('total_sales_to_sort', ascending=False)
del heatmap_data['total_sales_to_sort']
heatmap_data.head()
Year | 1995.0 | 1996.0 | 1997.0 | 1998.0 | 1999.0 | 2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | ... | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | 2012.0 | 2013.0 | 2014.0 | 2015.0 | 2016.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Platform | |||||||||||||||||||||
PS2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39.11 | 166.43 | 205.4 | 184.29 | 211.78 | ... | 76.00 | 53.83 | 26.45 | 5.63 | 0.47 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
X360 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | ... | 95.84 | 135.76 | 120.85 | 171.05 | 145.12 | 100.88 | 89.61 | 36.42 | 13.05 | 0.83 |
PS3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | ... | 73.81 | 119.69 | 132.34 | 144.42 | 159.37 | 109.49 | 117.39 | 50.96 | 18.22 | 2.59 |
Wii | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | ... | 154.97 | 174.16 | 210.44 | 131.80 | 62.41 | 22.77 | 9.36 | 4.44 | 1.55 | 0.00 |
DS | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.00 | 17.46 | ... | 149.36 | 147.89 | 121.99 | 87.98 | 27.80 | 11.64 | 1.96 | 0.02 | 0.00 | 0.00 |
5 rows × 22 columns
plt.figure(figsize=(30,10))
plt.title('Sales per year and platform')
plt.xlabel('Year')
plt.ylabel('Sales')
# platforms_to_show = ['Wii', 'DS', 'X360', 'PS3', 'PS2', 'PS', 'N64', '3DS']
# years_to_show = range(1995, 2017)
sns.heatmap(heatmap_data, annot=True)
<AxesSubplot:title={'center':'Sales per year and platform'}, xlabel='Year', ylabel='Platform'>
La query que hacemos para obtener estos datos no es nada "elegante" escribimos muchos or para las distintas plataformas. ¿Se puede mejorar de alguna forma?
plt.figure(figsize=(20,10))
plt.title('Sales per year and platform')
plt.xlabel('Year')
plt.ylabel('Sales')
sns.lineplot(data=reduced_sales_per_year_and_platform, x='Year', y='Global_Sales', hue='Platform')
<AxesSubplot:title={'center':'Sales per year and platform'}, xlabel='Year', ylabel='Sales'>
Grafico de puntos por genero y ventas totales, agrupados por color dependiendo de la plataforma
Queremos saber si hay una relacion entre el genero y la plataforma. Para esto tenemos que agrupar las ventas globales por plataforma y genero. De tal manera, que solamente tengamos una entrada de datos con una plataforma y un genero, no debe repetirse esta combinacion.
relevant_data = df.query('Year>1994 and Year<2018 and (Platform == "Wii" or Platform == "DS" or Platform == "X360" or Platform == "PS3" or Platform == "PS2" or Platform == "PS" or Platform == "N64" or Platform == "3DS")')
relevant_data.head()
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|
Rank | ||||||||||
1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
7 | New Super Mario Bros. | DS | 2006.0 | Platform | Nintendo | 11.38 | 9.23 | 6.50 | 2.90 | 30.01 |
8 | Wii Play | Wii | 2006.0 | Misc | Nintendo | 14.03 | 9.20 | 2.93 | 2.85 | 29.02 |
genre_platform_group = relevant_data.groupby(['Genre', 'Platform']).Global_Sales.sum().reset_index()
genre_platform_group.shape
(96, 3)
Con la operación groupby
juntamos por la unión de estos dos campos Genre
Platform
, y sumamos el resto de valores de la fila. En este caso, las ventas globales. Para evitar que cree un índice nuevo con estos dos campos por los que agrupamos usamos la función reset_index
. De esta forma los campos por los que hacemos la unión quedan accesibles mediante su columna correspondiente.
genre_platform_group.head()
Genre | Platform | Global_Sales | |
---|---|---|---|
0 | Action | 3DS | 56.61 |
1 | Action | DS | 114.17 |
2 | Action | N64 | 29.44 |
3 | Action | PS | 125.74 |
4 | Action | PS2 | 272.43 |
action_games = genre_platform_group[genre_platform_group['Genre'] == 'Action'].sort_values("Global_Sales", ascending=False)
plt.figure(figsize=(20,10))
plt.title("Global sales of action games per paltform")
plt.ylabel("Global total sales")
plt.xlabel("Platform")
sns.barplot(data=action_games, x='Platform', y='Global_Sales')
<AxesSubplot:title={'center':'Global sales of action games per paltform'}, xlabel='Platform', ylabel='Global_Sales'>
wii_games = genre_platform_group[genre_platform_group['Platform'] == 'Wii'].sort_values("Global_Sales", ascending=False)
plt.figure(figsize=(20,10))
plt.title("Global sales of action games per paltform")
plt.ylabel("Global total sales")
plt.xlabel("Genre")
sns.barplot(data=wii_games, x='Genre', y='Global_Sales')
<AxesSubplot:title={'center':'Global sales of action games per paltform'}, xlabel='Genre', ylabel='Global_Sales'>
En este caso el pico de ventas en juegos de deportes en la plataforma Wii se debe a que el juego "Wii sports" se "vendia" junto con la consola.
genre_platform_group.head()
Genre | Platform | Global_Sales | |
---|---|---|---|
0 | Action | 3DS | 56.61 |
1 | Action | DS | 114.17 |
2 | Action | N64 | 29.44 |
3 | Action | PS | 125.74 |
4 | Action | PS2 | 272.43 |
# funcion para obtener el genero que mas ventas tiene por plataforma
result = genre_platform_group.groupby('Platform').Global_Sales.max()
result.head()
Platform 3DS 75.71 DS 137.67 N64 40.09 PS 125.74 PS2 272.43 Name: Global_Sales, dtype: float64