In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

Aqui eu estou apenas definindo algumas questões estáticas na apresentação dos gráficos

In [2]:
sns.set_style('whitegrid')
sns.set_palette('flare_r')
sns.color_palette('flare_r')
Out[2]:

Afim de me aprofundar um pouco mais em data science, regressões e machine learning, baixei o conjunto de dados disponibilizado nesse link (por Nik Davis):
https://www.kaggle.com/nikdavis/steam-store-games

Dentro do arquivo de download, haviam 6 arquivos .csv. Dessa forma, vou primeiro dar uma olhadinha em cada 1 deles para ter mais noção do tipo e variação de dados que tenho a disposição.

In [3]:
df1 = pd.read_csv('Dados Brutos/steam.csv')
In [4]:
df1.head()
Out[4]:
appid name release_date english developer publisher platforms required_age categories genres steamspy_tags achievements positive_ratings negative_ratings average_playtime median_playtime owners price
0 10 Counter-Strike 2000-11-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 124534 3339 17612 317 10000000-20000000 7.19
1 20 Team Fortress Classic 1999-04-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 3318 633 277 62 5000000-10000000 3.99
2 30 Day of Defeat 2003-05-01 1 Valve Valve windows;mac;linux 0 Multi-player;Valve Anti-Cheat enabled Action FPS;World War II;Multiplayer 0 3416 398 187 34 5000000-10000000 3.99
3 40 Deathmatch Classic 2001-06-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 1273 267 258 184 5000000-10000000 3.99
4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software Valve windows;mac;linux 0 Single-player;Multi-player;Valve Anti-Cheat en... Action FPS;Action;Sci-fi 0 5250 288 624 415 5000000-10000000 3.99
In [5]:
df1.shape
Out[5]:
(27075, 18)
In [6]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27075 non-null  object 
 5   publisher         27075 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object 
 17  price             27075 non-null  float64
dtypes: float64(1), int64(8), object(9)
memory usage: 3.7+ MB
In [7]:
df1.describe()
Out[7]:
appid english required_age achievements positive_ratings negative_ratings average_playtime median_playtime price
count 2.707500e+04 27075.000000 27075.000000 27075.000000 2.707500e+04 27075.000000 27075.000000 27075.00000 27075.000000
mean 5.962035e+05 0.981127 0.354903 45.248864 1.000559e+03 211.027147 149.804949 146.05603 6.078193
std 2.508942e+05 0.136081 2.406044 352.670281 1.898872e+04 4284.938531 1827.038141 2353.88008 7.874922
min 1.000000e+01 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.00000 0.000000
25% 4.012300e+05 1.000000 0.000000 0.000000 6.000000e+00 2.000000 0.000000 0.00000 1.690000
50% 5.990700e+05 1.000000 0.000000 7.000000 2.400000e+01 9.000000 0.000000 0.00000 3.990000
75% 7.987600e+05 1.000000 0.000000 23.000000 1.260000e+02 42.000000 0.000000 0.00000 7.190000
max 1.069460e+06 1.000000 18.000000 9821.000000 2.644404e+06 487076.000000 190625.000000 190625.00000 421.990000
In [8]:
df2 = pd.read_csv('Dados Brutos/steam_description_data.csv')
In [9]:
df2.head(10)
Out[9]:
steam_appid detailed_description about_the_game short_description
0 10 Play the world's number 1 online action game. ... Play the world's number 1 online action game. ... Play the world's number 1 online action game. ...
1 20 One of the most popular online action games of... One of the most popular online action games of... One of the most popular online action games of...
2 30 Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ... Enlist in an intense brand of Axis vs. Allied ...
3 40 Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death... Enjoy fast-paced multiplayer gaming with Death...
4 50 Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ... Return to the Black Mesa Research Facility as ...
5 60 A futuristic action game that challenges your ... A futuristic action game that challenges your ... A futuristic action game that challenges your ...
6 70 Named Game of the Year by over 50 publications... Named Game of the Year by over 50 publications... Named Game of the Year by over 50 publications...
7 80 With its extensive Tour of Duty campaign, a ne... With its extensive Tour of Duty campaign, a ne... With its extensive Tour of Duty campaign, a ne...
8 130 Made by Gearbox Software and originally releas... Made by Gearbox Software and originally releas... Made by Gearbox Software and originally releas...
9 220 1998. HALF-LIFE sends a shock through the game... 1998. HALF-LIFE sends a shock through the game... 1998. HALF-LIFE sends a shock through the game...
In [10]:
df2.shape
Out[10]:
(27334, 4)
In [11]:
df2.describe()
Out[11]:
steam_appid
count 2.733400e+04
mean 5.982886e+05
std 2.512113e+05
min 1.000000e+01
25% 4.030925e+05
50% 6.019650e+05
75% 8.011750e+05
max 1.069460e+06
In [12]:
df3 = pd.read_csv('Dados Brutos/steam_media_data.csv')
In [13]:
df3.head()
Out[13]:
steam_appid header_image screenshots background movies
0 10 https://steamcdn-a.akamaihd.net/steam/apps/10/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/10/... NaN
1 20 https://steamcdn-a.akamaihd.net/steam/apps/20/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/20/... NaN
2 30 https://steamcdn-a.akamaihd.net/steam/apps/30/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/30/... NaN
3 40 https://steamcdn-a.akamaihd.net/steam/apps/40/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/40/... NaN
4 50 https://steamcdn-a.akamaihd.net/steam/apps/50/... [{'id': 0, 'path_thumbnail': 'https://steamcdn... https://steamcdn-a.akamaihd.net/steam/apps/50/... NaN
In [14]:
df4 = pd.read_csv('Dados Brutos/steam_requirements_data.csv')
In [15]:
df4.head()
Out[15]:
steam_appid pc_requirements mac_requirements linux_requirements minimum recommended
0 10 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
1 20 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
2 30 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
3 40 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
4 50 {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... {'minimum': 'Minimum: OS X Snow Leopard 10.6.... {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... 500 mhz processor, 96mb ram, 16mb video card, ... NaN
In [16]:
df5 = pd.read_csv('Dados Brutos/steam_support_info.csv')
In [17]:
df5.head()
Out[17]:
steam_appid website support_url support_email
0 10 NaN http://steamcommunity.com/app/10 NaN
1 30 http://www.dayofdefeat.com/ NaN NaN
2 50 NaN https://help.steampowered.com NaN
3 70 http://www.half-life.com/ http://steamcommunity.com/app/70 NaN
4 80 NaN http://steamcommunity.com/app/80 NaN
In [18]:
df6 = pd.read_csv('Dados Brutos/steamspy_tag_data.csv')
In [19]:
df6.head()
Out[19]:
appid 1980s 1990s 2.5d 2d 2d_fighter 360_video 3d 3d_platformer 3d_vision ... warhammer_40k web_publishing werewolves western word_game world_war_i world_war_ii wrestling zombies e_sports
0 10 144 564 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 550
1 20 0 71 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 30 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 5 122 0 0 0
3 40 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 50 0 77 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 372 columns

Deu pra dar uma boa olhada nas bases. Provavelmente a mais útil pro que tenho em mente será a primeira mesmo, porém, acho que existe bastante espaço pra fazer algum tipo de nuvem de palavras com a segunda, que contém descrições dos jogos, e talvez dê pra ter uma noção de qual a configuração ideal de computador pra rodar jogos da steam usando essas informações da quarta base.

Analisando o primeiro dataframe

É possível notar que, na primeira base, existe um campo de data. Ele está com a tipagem de 'object', porém, acredito que faça mais sentido avalia-lo como data.

In [20]:
df = df1.copy()
In [21]:
df.head()
Out[21]:
appid name release_date english developer publisher platforms required_age categories genres steamspy_tags achievements positive_ratings negative_ratings average_playtime median_playtime owners price
0 10 Counter-Strike 2000-11-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 124534 3339 17612 317 10000000-20000000 7.19
1 20 Team Fortress Classic 1999-04-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 3318 633 277 62 5000000-10000000 3.99
2 30 Day of Defeat 2003-05-01 1 Valve Valve windows;mac;linux 0 Multi-player;Valve Anti-Cheat enabled Action FPS;World War II;Multiplayer 0 3416 398 187 34 5000000-10000000 3.99
3 40 Deathmatch Classic 2001-06-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action Action;FPS;Multiplayer 0 1273 267 258 184 5000000-10000000 3.99
4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software Valve windows;mac;linux 0 Single-player;Multi-player;Valve Anti-Cheat en... Action FPS;Action;Sci-fi 0 5250 288 624 415 5000000-10000000 3.99

Aparentemente, a primeira coluna possui os dados de identificação, então, provavelmente só sera utilizada para eventuais joins.
Já a segunda coluna apresenta os nomes dos jogos. Muito provavelmente só usaremos para identificação também, dessa forma...
Podemos começar a avaliar as colunas a partir da terceira, que possui data.

In [22]:
df['release_date'] = pd.to_datetime(df['release_date'])
In [24]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   appid             27075 non-null  int64         
 1   name              27075 non-null  object        
 2   release_date      27075 non-null  datetime64[ns]
 3   english           27075 non-null  int64         
 4   developer         27075 non-null  object        
 5   publisher         27075 non-null  object        
 6   platforms         27075 non-null  object        
 7   required_age      27075 non-null  int64         
 8   categories        27075 non-null  object        
 9   genres            27075 non-null  object        
 10  steamspy_tags     27075 non-null  object        
 11  achievements      27075 non-null  int64         
 12  positive_ratings  27075 non-null  int64         
 13  negative_ratings  27075 non-null  int64         
 14  average_playtime  27075 non-null  int64         
 15  median_playtime   27075 non-null  int64         
 16  owners            27075 non-null  object        
 17  price             27075 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(8), object(8)
memory usage: 3.7+ MB

Agora que o campo 'release_date' está com a tipagem correta, vamos dar uma olhada nas datas e tentar descobrir algumas informações acerca da idade dos jogos.

In [25]:
df['release_date'].max()
Out[25]:
Timestamp('2019-05-01 00:00:00')
In [26]:
df['release_date'].min()
Out[26]:
Timestamp('1997-06-30 00:00:00')
In [27]:
df[df['release_date'] == df['release_date'].max()]
Out[27]:
appid name release_date english developer publisher platforms required_age categories genres steamspy_tags achievements positive_ratings negative_ratings average_playtime median_playtime owners price
23726 905370 Conqueror's Blade 2019-05-01 1 Booming Games My.com windows 0 Online Multi-Player;MMO;Online Co-op;In-App Pu... Action;Free to Play;Massively Multiplayer;RPG;... Action;Strategy;RPG 0 259 235 0 0 0-20000 11.39
In [28]:
df[df['release_date'] == df['release_date'].min()]
Out[28]:
appid name release_date english developer publisher platforms required_age categories genres steamspy_tags achievements positive_ratings negative_ratings average_playtime median_playtime owners price
2685 282010 Carmageddon Max Pack 1997-06-30 1 Stainless Games Ltd THQ Nordic windows 0 Single-player;Multi-player;Steam Trading Cards Action;Indie;Racing Racing;Action;Classic 0 352 35 13 13 50000-100000 5.99

Aqui é interessante ter uma ideia da distribuição dos jogos ao longo dos anos. Podemos supor, em um primeiro momento, que, uma vez que a Steam foi lançada por volta de 2003, e, começou a incluir jogos de plataformas como Linux e MacOS depois de 2009, provavelmente exista uma maior distribuição de jogos conforme os anos avançam.

alt alt
PT-BR EN
In [29]:
df['Year'] = df['release_date'].apply(lambda x: x.year)
In [30]:
ax = sns.histplot(data=df, x='Year', stat='probability', discrete=True)
ax.figure.set_size_inches(15,6)

Esse dataset foi extraido durante o mês de maio de 2019, por isso, é notável essa queda ao fim do gráfico (podemos imaginar que a coluna referente ao ano de 2019 seria maior se os dados tivessem sido extraidos em dezembro). Podemos dar uma olhada na quarta coluna

In [31]:
df['english'].unique()
Out[31]:
array([1, 0], dtype=int64)

Aparentemente, só possui 2 valores, indicando se o jogo tem suporte a lingua inglesa ou não. Vale a pena dar uma olhada nos jogos que não possuem esse suporte.

In [32]:
df[df['english'] == 0]
Out[32]:
appid name release_date english developer publisher platforms required_age categories genres steamspy_tags achievements positive_ratings negative_ratings average_playtime median_playtime owners price Year
1297 207370 eXceed - Gun Bullet Children 2012-08-02 0 Flat Software Nyu Media windows 0 Single-player;Steam Trading Cards Action;Indie Bullet Hell;Anime;Shoot 'Em Up 0 204 215 97 108 50000-100000 1.99 2012
2221 262300 Tsukumogami 2012-05-31 0 TORaIKI Fruitbat Factory windows 0 Single-player Adventure;Indie;RPG RPG;Indie;Adventure 49 33 17 0 0 50000-100000 22.99 2012
3001 292990 NOBUNAGA'S AMBITION: Souzou (Traditional Chine... 2014-08-01 0 KOEI TECMO GAMES CO., LTD. KOEI TECMO GAMES CO., LTD. windows 0 Single-player;Steam Cloud Strategy Strategy;Historical 0 413 74 0 0 20000-50000 17.99 2014
3085 295950 新流星搜劍錄 2017-12-20 0 Titans Titans windows 18 Multi-player;Online Multi-Player;MMO Violent;Action;Adventure;Indie;Massively Multi... Action;RPG;Fighting 0 664 323 0 0 0-20000 15.49 2017
3745 318300 Phineas and Ferb: New Inventions 2015-02-24 0 PIPE Studio Disney Interactive windows 0 Single-player Adventure Adventure;Psychological Horror;Sexual Content 0 5 7 0 0 0-20000 15.49 2015
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27021 1056010 Не падай! 2019-04-17 0 SharF Co. SharF Co. windows 0 Single-player;Steam Achievements Casual;Indie;Racing;Simulation Casual;Indie;Simulation 3 1 1 0 0 0-20000 1.69 2019
27024 1056500 VRLife 2019-04-11 0 Wuhan Mandrill VR Co., Ltd. Wuhan Mandrill VR Co., Ltd. windows 0 Online Multi-Player;MMO Casual;Free to Play;Massively Multiplayer;RPG;... Early Access;Free to Play;Massively Multiplayer 0 16 3 0 0 0-20000 0.00 2019
27026 1056710 VTB Basketball League VR 2019-04-12 0 Axyos Games LikeVR windows 0 Single-player Casual;Free to Play;Sports Free to Play;Casual;Sports 0 3 0 0 0 0-20000 0.00 2019
27036 1058000 Rain's love memory-雨的恋记 2019-04-19 0 the original air the original air windows 0 Single-player Free to Play Free to Play;Visual Novel 0 33 2 0 0 0-20000 0.00 2019
27060 1062240 地铁:恐怖末班车 2019-04-25 0 QZQ Studio H.P.Y.S,LLC windows 0 Single-player Action;Adventure;Indie;RPG Action;Adventure;RPG 0 0 2 0 0 0-20000 2.09 2019

511 rows × 19 columns

In [33]:
df[df['english'] == 0].count()
Out[33]:
appid               511
name                511
release_date        511
english             511
developer           511
publisher           511
platforms           511
required_age        511
categories          511
genres              511
steamspy_tags       511
achievements        511
positive_ratings    511
negative_ratings    511
average_playtime    511
median_playtime     511
owners              511
price               511
Year                511
dtype: int64

Boa parte deles parece ter vindo de países asiáticos. Interessante. Existe a possibilidade de, devido a esses jogos serem produzidos em países asiáticos, o valor dos mesmos seja menor, uma vez que, na média, o valor das moedas asiaticas é menor que da moeda americana ou européia. Além disso, existe também a possibilidade de a ausência de suporte a lingua inglesa indicar limitações no capital investido na confecção do jogo.
Podemos dar uma olhada melhor nesse seleto grupo em outro momento, por enquanto, vamos prosseguir.

In [34]:
df['developer'].unique()
Out[34]:
array(['Valve', 'Gearbox Software', 'Valve;Hidden Path Entertainment',
       ..., 'SHEN JIAWEI', 'Semyon Maximov', 'Adept Studios GD'],
      dtype=object)
In [35]:
df['developer'].mode()
Out[35]:
0    Choice of Games
dtype: object

Essa moda não reflete, de fato, a moda real do conjunto de desenvolvedores, já que vários deles estão concatenados na mesma coluna.

In [36]:
desenv_split = df['developer'].apply(lambda x: pd.Series(x.split(';')))
In [37]:
desenv_split
Out[37]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 Valve NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Valve NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Valve NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Valve NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Gearbox Software NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27070 SHEN JIAWEI NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27071 Semyon Maximov NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27072 EntwicklerX NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27073 Yustas Game Studio NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27074 Adept Studios GD NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

27075 rows × 15 columns

In [38]:
desenv = []
for index, row in desenv_split.iterrows():
    for i in desenv_split.columns:
        desenv.append(row[i])

(Essa foi a solução que eu achei para listar todas as desenvolvedoras. Não foi das mais elegantes, então, se você tiver alguma ideia pra substituir essa monstruosidade, eu estou aberto a sugestões)

In [39]:
desenv = pd.Series(desenv)
In [40]:
desenv.dropna(inplace=True)
In [41]:
desenv.mode()
Out[41]:
0    Choice of Games
dtype: object

Agora temos a verdadeira moda (que, ironicamente, era a mesma de antes).

Infelizmente, esses dados são muito variados, então, acho improvável transforma-los em variáveis binárias ou ordinais. Dessa forma, existe um maior proveito em simplesmente tentar identificar algumas desenvolvedoras mais proeminentes.

In [42]:
desenv.value_counts().head(10)
Out[42]:
Choice of Games               94
KOEI TECMO GAMES CO., LTD.    75
Ripknot Systems               62
Laush Dmitriy Sergeevich      51
Nikita "Ghost_RUS"            51
Dexion Games                  45
Feral Interactive (Mac)       44
RewindApp                     44
Hosted Games                  42
Blender Games                 40
dtype: int64
In [43]:
desenv.value_counts().tail(10)
Out[43]:
Giant Box Games        1
Nuclear Tales          1
Frost Dog Studio       1
Chicken in The Corn    1
Lee Fuller             1
MegawattsCo            1
Rockstar Leeds         1
Redro Games            1
Andy Campbell          1
Toothless T-Rex        1
dtype: int64

Provavelmente teremos o mesmo problema com a publisher.

In [44]:
df['publisher'].unique()
Out[44]:
array(['Valve', 'Mark Healey', 'Tripwire Interactive', ..., 'MonteCube',
       'Velvet Paradise Games', 'SHEN JIAWEI'], dtype=object)
In [45]:
df['publisher'].mode()
Out[45]:
0    Big Fish Games
dtype: object
In [46]:
publi_split = df['publisher'].apply(lambda x: pd.Series(x.split(';')))
In [47]:
publi_split
Out[47]:
0 1 2 3 4
0 Valve NaN NaN NaN NaN
1 Valve NaN NaN NaN NaN
2 Valve NaN NaN NaN NaN
3 Valve NaN NaN NaN NaN
4 Valve NaN NaN NaN NaN
... ... ... ... ... ...
27070 SHEN JIAWEI NaN NaN NaN NaN
27071 BekkerDev Studio NaN NaN NaN NaN
27072 EntwicklerX NaN NaN NaN NaN
27073 Alawar Entertainment NaN NaN NaN NaN
27074 Alawar Entertainment NaN NaN NaN NaN

27075 rows × 5 columns

In [48]:
publi = []
for index, row in publi_split.iterrows():
    for i in publi_split.columns:
        publi.append(row[i])
In [49]:
publi = pd.Series(publi)
In [50]:
publi.dropna(inplace=True)
In [51]:
publi.mode()
Out[51]:
0    Big Fish Games
dtype: object
In [52]:
publi.value_counts().head(10)
Out[52]:
Big Fish Games       212
Strategy First       136
THQ Nordic           119
Ubisoft              114
Plug In Digital      111
Square Enix          107
Nightdive Studios    101
1C Entertainment      96
Sekai Project         96
Choice of Games       94
dtype: int64
In [53]:
publi.value_counts().tail(10)
Out[53]:
Anatola Howard        1
Team Frizz            1
Pixel Wonder          1
Wispfire              1
Arif Games            1
Sergey Sevostyanov    1
Bullet Bears          1
FYRG                  1
Frogmouth             1
Mindwedge             1
dtype: int64

Algumas dessas informações são bem interessantes. Particularmente, eu não conhecia algumas dessas empresas.

Agora, seguiremos para avaliar as plataformas

In [54]:
df['platforms'].unique()
Out[54]:
array(['windows;mac;linux', 'windows;mac', 'windows', 'windows;linux',
       'mac', 'mac;linux', 'linux'], dtype=object)

Meu palpite é de que, jogos que estão disponíveis em mais plataformas provavelmente venderiam mais, o que pode acarretar no número maior de owners.

Pensando em como tratar esse campo, talvez seja interessante criar 3 novas colunas de valor binário, pra determinar se um dado jogo está disponível para essa plataforma ou não.

In [55]:
df['platforms_windows'] = df['platforms'].str.contains('windows', regex=False).astype(int)
df['platforms_mac'] = df['platforms'].str.contains('mac', regex=False).astype(int)
df['platforms_linux'] = df['platforms'].str.contains('linux', regex=False).astype(int)
In [56]:
df.tail(10)
Out[56]:
appid name release_date english developer publisher platforms required_age categories genres ... positive_ratings negative_ratings average_playtime median_playtime owners price Year platforms_windows platforms_mac platforms_linux
27065 1063560 12 HOURS 2019-04-24 1 Second Reality Second Reality windows 0 Single-player Action;Indie;Simulation;Early Access ... 9 1 0 0 0-20000 2.09 2019 1 0 0
27066 1064060 The Mystery of Bikini Island 2019-04-24 1 Velvet Paradise Games Velvet Paradise Games windows 0 Single-player Adventure;Casual;Indie;RPG;Early Access ... 3 0 0 0 0-20000 0.79 2019 1 0 0
27067 1064580 CaptainMarlene 2019-04-23 1 Dnovel Dnovel windows 0 Single-player Adventure;Indie;Early Access ... 13 0 0 0 0-20000 9.29 2019 1 0 0
27068 1064890 Old Edge II 2019-04-23 1 LTrust LTrust windows 0 Single-player Adventure;Casual;Indie;Simulation ... 4 0 0 0 0-20000 0.79 2019 1 0 0
27069 1065160 Deatherem 2019-04-23 1 Anima Team Syukino windows 0 Single-player;Steam Achievements Indie ... 2 2 0 0 0-20000 2.09 2019 1 0 0
27070 1065230 Room of Pandora 2019-04-24 1 SHEN JIAWEI SHEN JIAWEI windows 0 Single-player;Steam Achievements Adventure;Casual;Indie ... 3 0 0 0 0-20000 2.09 2019 1 0 0
27071 1065570 Cyber Gun 2019-04-23 1 Semyon Maximov BekkerDev Studio windows 0 Single-player Action;Adventure;Indie ... 8 1 0 0 0-20000 1.69 2019 1 0 0
27072 1065650 Super Star Blast 2019-04-24 1 EntwicklerX EntwicklerX windows 0 Single-player;Multi-player;Co-op;Shared/Split ... Action;Casual;Indie ... 0 1 0 0 0-20000 3.99 2019 1 0 0
27073 1066700 New Yankee 7: Deer Hunters 2019-04-17 1 Yustas Game Studio Alawar Entertainment windows;mac 0 Single-player;Steam Cloud Adventure;Casual;Indie ... 2 0 0 0 0-20000 5.19 2019 1 1 0
27074 1069460 Rune Lord 2019-04-24 1 Adept Studios GD Alawar Entertainment windows;mac 0 Single-player;Steam Cloud Adventure;Casual;Indie ... 4 0 0 0 0-20000 5.19 2019 1 1 0

10 rows × 22 columns

Essa transformação nos permite, também, ter uma noção de quanto jogos estão disponíveis em cada plataforma, simplesmente efetuando uma soma.

In [57]:
df['platforms_windows'].sum()
Out[57]:
27070
In [58]:
df['platforms_mac'].sum()
Out[58]:
8066
In [59]:
df['platforms_linux'].sum()
Out[59]:
5235

Aparentemente, existem mais jogos disponíveis para windows, e, por último, para linux. Acho que é interessante calcular a correlação entre essas 3 variáveis, pra evitar problemas de multicolinearidade futuros.

In [60]:
df[['platforms_windows','platforms_mac','platforms_linux']].corr()
Out[60]:
platforms_windows platforms_mac platforms_linux
platforms_windows 1.000000 -0.014920 -0.007111
platforms_mac -0.014920 1.000000 0.626648
platforms_linux -0.007111 0.626648 1.000000

Existe uma correlação razoável entre mac e linux. Minha hipótese é a de que, uma vez que windows possui mais entradas, provavelmente quando uma desenvolvedora decide produzir para alguma plataforma, ela opta pelo windows, e, quando quer desenvolver para mais de uma, acaba por optar por desenvolver para linux E mac, e não linux OU mac. Podemos dar uma olhada novamente na coluna platforms, pra ter uma ideia melhor.

In [61]:
df[df['platforms'] == 'windows;mac;linux'].shape[0]
Out[61]:
4623
In [62]:
df[df['platforms'] == 'mac;linux'].shape[0]
Out[62]:
1
In [63]:
df[df['platforms'] == 'windows;mac'].shape[0]
Out[63]:
3439
In [64]:
df[df['platforms'] == 'windows;linux'].shape[0]
Out[64]:
610

Como eu pensei. Apesar de poucos jogos serem lançados somente para mac e linux (1), boa parte deles é lançado juntamente com windows. Isso significa que, uma vez que o jogo saia para linux, é esperado também que saia para mac, e o inverso é, em menor grau, também verdade.

Agora, vamos ver a variável 'required_age'

In [65]:
df['required_age'].unique()
Out[65]:
array([ 0, 18, 16, 12,  7,  3], dtype=int64)

O campo 'required_age' está em formato numérico, o que pode facilitar a nossa eventual regressão.

In [66]:
ax = sns.histplot(data=df, x='required_age', discrete=True, stat='probability')
ax.figure.set_size_inches(15,6)
ax.set(xticks=range(0,20,2))
Out[66]:
[[<matplotlib.axis.XTick at 0x2c7e274f9a0>,
  <matplotlib.axis.XTick at 0x2c7e274fd30>,
  <matplotlib.axis.XTick at 0x2c7e276a070>,
  <matplotlib.axis.XTick at 0x2c7e277a4f0>,
  <matplotlib.axis.XTick at 0x2c7e277aa90>,
  <matplotlib.axis.XTick at 0x2c7e277f100>,
  <matplotlib.axis.XTick at 0x2c7e277f6a0>,
  <matplotlib.axis.XTick at 0x2c7e277fe80>,
  <matplotlib.axis.XTick at 0x2c7e2784130>,
  <matplotlib.axis.XTick at 0x2c7e277f2e0>]]

Parace que a grande maioria dos jogos na nossa amostra não requer idade mínima. Pode ser interessante, mais pra frente, dar uma olhada na quantidade de pessoas que detem jogos nas últimas duas faixas de idade, por curiosidade.

Agora, vamos prosseguir para a próxima variável.

In [67]:
df['categories'].unique()
Out[67]:
array(['Multi-player;Online Multi-Player;Local Multi-Player;Valve Anti-Cheat enabled',
       'Multi-player;Valve Anti-Cheat enabled',
       'Single-player;Multi-player;Valve Anti-Cheat enabled', ...,
       'Online Multi-Player;Steam Achievements;Full controller support;In-App Purchases;Steam Cloud',
       'Multi-player;Local Multi-Player;Co-op;Local Co-op;Shared/Split Screen',
       'Multi-player;Online Multi-Player;Cross-Platform Multiplayer;Stats'],
      dtype=object)
In [68]:
len(df['categories'].unique().tolist())
Out[68]:
3333

Parece que existe uma grande quantidade de variações de categorias. Vou tentar utilizar a mesma técnica que utilizamos para separar developers e publishers, e, dependendo de como ficar, talvez seja viável transforma-las em variáveis binárias, como no caso das platforms.

In [69]:
categ_split = df['categories'].apply(lambda x: pd.Series(x.split(';')))
In [70]:
categ_split
Out[70]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 Multi-player Online Multi-Player Local Multi-Player Valve Anti-Cheat enabled NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Multi-player Online Multi-Player Local Multi-Player Valve Anti-Cheat enabled NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Multi-player Valve Anti-Cheat enabled NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Multi-player Online Multi-Player Local Multi-Player Valve Anti-Cheat enabled NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Single-player Multi-player Valve Anti-Cheat enabled NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27070 Single-player Steam Achievements NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27071 Single-player NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27072 Single-player Multi-player Co-op Shared/Split Screen Steam Achievements Full controller support Steam Cloud NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27073 Single-player Steam Cloud NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27074 Single-player Steam Cloud NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

27075 rows × 18 columns

Parece que um mesmo jogo pode ter até 18 categorias diferentes. Apesar de serem muitas, eu acredito que isso possa influenciar tanto a quantidade de pessoas que detem o jogo, como o seu preço. Acho que vou ter que criar colunas binárias pra eles então...

In [71]:
categ = []
for index, row in categ_split.iterrows():
    for i in categ_split.columns:
        categ.append(row[i])
In [72]:
categ = pd.Series(categ)
In [73]:
categ.dropna(inplace=True)
In [74]:
categ.unique()
Out[74]:
array(['Multi-player', 'Online Multi-Player', 'Local Multi-Player',
       'Valve Anti-Cheat enabled', 'Single-player', 'Steam Cloud',
       'Steam Achievements', 'Steam Trading Cards', 'Captions available',
       'Partial Controller Support', 'Includes Source SDK',
       'Cross-Platform Multiplayer', 'Stats', 'Commentary available',
       'Includes level editor', 'Steam Workshop', 'In-App Purchases',
       'Co-op', 'Full controller support', 'Steam Leaderboards',
       'SteamVR Collectibles', 'Online Co-op', 'Shared/Split Screen',
       'Local Co-op', 'MMO', 'VR Support', 'Mods', 'Mods (require HL2)',
       'Steam Turn Notifications'], dtype=object)
In [75]:
len(categ.unique())
Out[75]:
29

Parece que me precipitei. Essencialmente, existem 29 categorias diferentes, só nesta amostra. Talvez seja sim possível que um jogo tenha mais de 18 categorias então. POrém, é perceptível que algumas dessas categorias parecem um pouco redundantes. Talvez seja interessante calcular correlação entre elas. Também pode ser interessante calcular a quantidade de categorias que cada jogo tem, já que o simples número de categorias pode ser um indicador também.

In [76]:
categ.mode()
Out[76]:
0    Single-player
dtype: object
In [77]:
categ.value_counts().head(10)
Out[77]:
Single-player                 25678
Steam Achievements            14130
Steam Trading Cards            7918
Steam Cloud                    7219
Full controller support        5695
Partial Controller Support     4234
Multi-player                   3974
Steam Leaderboards             3439
Online Multi-Player            2487
Shared/Split Screen            2152
dtype: int64
In [78]:
categ.value_counts().tail(10)
Out[78]:
In-App Purchases            690
MMO                         421
VR Support                  231
Commentary available        144
Valve Anti-Cheat enabled     94
Steam Turn Notifications     63
SteamVR Collectibles         40
Includes Source SDK          35
Mods                          2
Mods (require HL2)            1
dtype: int64
In [79]:
categ.unique().tolist()
Out[79]:
['Multi-player',
 'Online Multi-Player',
 'Local Multi-Player',
 'Valve Anti-Cheat enabled',
 'Single-player',
 'Steam Cloud',
 'Steam Achievements',
 'Steam Trading Cards',
 'Captions available',
 'Partial Controller Support',
 'Includes Source SDK',
 'Cross-Platform Multiplayer',
 'Stats',
 'Commentary available',
 'Includes level editor',
 'Steam Workshop',
 'In-App Purchases',
 'Co-op',
 'Full controller support',
 'Steam Leaderboards',
 'SteamVR Collectibles',
 'Online Co-op',
 'Shared/Split Screen',
 'Local Co-op',
 'MMO',
 'VR Support',
 'Mods',
 'Mods (require HL2)',
 'Steam Turn Notifications']
In [80]:
df_corr = pd.DataFrame()
df_corr
Out[80]:
In [81]:
for i in categ.unique().tolist():
    nome = 'categories_'+i
    df[nome] = df['categories'].str.contains(i, regex=False).astype(int)
    df_corr[nome] = df['categories'].str.contains(i, regex=False).astype(int)
In [82]:
df_corr.columns
Out[82]:
Index(['categories_Multi-player', 'categories_Online Multi-Player',
       'categories_Local Multi-Player', 'categories_Valve Anti-Cheat enabled',
       'categories_Single-player', 'categories_Steam Cloud',
       'categories_Steam Achievements', 'categories_Steam Trading Cards',
       'categories_Captions available',
       'categories_Partial Controller Support',
       'categories_Includes Source SDK',
       'categories_Cross-Platform Multiplayer', 'categories_Stats',
       'categories_Commentary available', 'categories_Includes level editor',
       'categories_Steam Workshop', 'categories_In-App Purchases',
       'categories_Co-op', 'categories_Full controller support',
       'categories_Steam Leaderboards', 'categories_SteamVR Collectibles',
       'categories_Online Co-op', 'categories_Shared/Split Screen',
       'categories_Local Co-op', 'categories_MMO', 'categories_VR Support',
       'categories_Mods', 'categories_Mods (require HL2)',
       'categories_Steam Turn Notifications'],
      dtype='object')
In [83]:
df['categories_QTD'] = 1+ df['categories'].str.contains(';', regex=False).astype(int)
df_corr['categories_QTD'] = 1+ df['categories'].str.contains(';', regex=False).astype(int)
In [84]:
df_corr.head(10)
Out[84]:
categories_Multi-player categories_Online Multi-Player categories_Local Multi-Player categories_Valve Anti-Cheat enabled categories_Single-player categories_Steam Cloud categories_Steam Achievements categories_Steam Trading Cards categories_Captions available categories_Partial Controller Support ... categories_SteamVR Collectibles categories_Online Co-op categories_Shared/Split Screen categories_Local Co-op categories_MMO categories_VR Support categories_Mods categories_Mods (require HL2) categories_Steam Turn Notifications categories_QTD
0 1 1 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
1 1 1 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
2 1 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
3 1 1 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
4 1 0 0 1 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
5 1 1 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
6 1 1 0 1 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
7 1 0 0 1 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
8 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
9 0 0 0 0 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 2

10 rows × 30 columns

In [85]:
df_corr.corr()
Out[85]:
categories_Multi-player categories_Online Multi-Player categories_Local Multi-Player categories_Valve Anti-Cheat enabled categories_Single-player categories_Steam Cloud categories_Steam Achievements categories_Steam Trading Cards categories_Captions available categories_Partial Controller Support ... categories_SteamVR Collectibles categories_Online Co-op categories_Shared/Split Screen categories_Local Co-op categories_MMO categories_VR Support categories_Mods categories_Mods (require HL2) categories_Steam Turn Notifications categories_QTD
categories_Multi-player 1.000000 0.503722 0.321236 0.122792 -0.268433 0.024880 0.008365 0.038045 -0.022577 0.049867 ... 0.011220 0.327593 0.273619 0.156427 0.173948 0.022803 0.020723 0.014653 0.081779 0.223955
categories_Online Multi-Player 0.503722 1.000000 0.244950 0.048625 -0.364006 -0.009864 -0.005357 0.000474 -0.022421 0.022914 ... 0.031050 0.541671 0.107472 0.088209 0.233930 0.006648 0.012146 -0.001933 0.056302 0.156664
categories_Local Multi-Player 0.321236 0.244950 1.000000 0.003693 -0.107629 0.042106 0.037201 0.000239 -0.019380 0.057296 ... -0.009688 0.125707 0.576333 0.532393 -0.020309 -0.008103 -0.002165 -0.001531 0.062281 0.136928
categories_Valve Anti-Cheat enabled 0.122792 0.048625 0.003693 1.000000 -0.085582 0.029730 0.027583 0.028312 0.017538 0.017807 ... 0.014078 0.026680 -0.005737 0.001047 0.038259 0.015006 0.145617 0.102965 -0.002851 0.032608
categories_Single-player -0.268433 -0.364006 -0.107629 -0.085582 1.000000 0.079473 0.083920 0.036539 0.021987 -0.015416 ... -0.008416 -0.157390 -0.066024 -0.023562 -0.340444 -0.011039 -0.036849 -0.026056 -0.009527 -0.041680
categories_Steam Cloud 0.024880 -0.009864 0.042106 0.029730 0.079473 1.000000 0.385004 0.295040 0.112453 0.023476 ... 0.007252 0.015186 0.092394 0.067056 -0.062278 -0.005986 0.014255 0.010079 0.017686 0.330935
categories_Steam Achievements 0.008365 -0.005357 0.037201 0.027583 0.083920 0.385004 1.000000 0.337887 0.091724 0.055034 ... 0.002165 0.018232 0.081981 0.066869 -0.049431 -0.003662 0.008227 0.005817 0.033948 0.574558
categories_Steam Trading Cards 0.038045 0.000474 0.000239 0.028312 0.036539 0.295040 0.337887 1.000000 0.063114 0.037730 ... 0.000639 0.026572 0.068636 0.034050 0.013703 0.007455 -0.005526 -0.003907 0.031304 0.353633
categories_Captions available -0.022577 -0.022421 -0.019380 0.017538 0.021987 0.112453 0.091724 0.063114 1.000000 0.002684 ... -0.000389 -0.010028 -0.018071 -0.010888 -0.015225 0.014588 -0.001422 -0.001005 0.001535 0.089208
categories_Partial Controller Support 0.049867 0.022914 0.057296 0.017807 -0.015416 0.023476 0.055034 0.037730 0.002684 1.000000 ... -0.016561 0.026876 0.073484 0.044791 -0.005619 0.016447 -0.003701 -0.002617 -0.018682 0.232807
categories_Includes Source SDK 0.051886 0.031266 0.016980 0.155159 -0.056663 0.015500 0.009742 -0.009571 0.064281 0.024131 ... -0.001384 0.019067 0.000829 0.003346 0.012095 0.019015 0.238900 0.168925 0.019597 0.017447
categories_Cross-Platform Multiplayer 0.374398 0.370801 0.100777 0.039278 -0.151986 0.010995 0.038840 0.058418 -0.012639 0.008286 ... 0.021626 0.244127 0.087233 0.035730 0.154290 0.032361 0.020197 -0.001239 0.123273 0.112659
categories_Stats 0.110646 0.093364 0.056457 0.055558 -0.037524 0.147696 0.212999 0.099639 0.044234 0.034147 ... 0.004638 0.063926 0.071334 0.051399 -0.004937 0.007867 0.014568 -0.001659 0.056210 0.150478
categories_Commentary available -0.005935 -0.009190 -0.007696 0.021583 0.003283 0.022514 0.015095 0.031132 0.098297 0.002071 ... -0.002813 -0.004419 -0.008346 -0.009514 0.011331 0.004259 -0.000628 -0.000444 0.028087 0.040397
categories_Includes level editor 0.113675 0.069886 0.040000 0.060239 -0.030067 0.090900 0.064109 0.063499 0.037564 0.040811 ... 0.002353 0.072127 0.056699 0.040200 -0.001726 0.038018 0.043090 0.030469 0.026331 0.105190
categories_Steam Workshop 0.103998 0.078313 0.025700 0.101343 -0.040783 0.129659 0.091247 0.106010 0.020653 0.029389 ... 0.003626 0.084188 0.037157 0.029715 -0.006585 0.043410 0.046432 0.032832 0.021042 0.094462
categories_In-App Purchases 0.143530 0.292627 -0.010052 0.042253 -0.247277 -0.042387 -0.017876 0.020718 -0.007824 -0.007035 ... -0.000118 0.153554 -0.026724 -0.016911 0.314963 -0.009905 -0.001390 -0.000983 0.031104 0.084908
categories_Co-op 0.426840 0.320879 0.323010 0.070195 -0.128345 0.066205 0.067465 0.067324 -0.003380 0.068216 ... 0.026591 0.622128 0.383019 0.618491 0.149355 0.009239 0.011772 0.018631 0.007646 0.179916
categories_Full controller support 0.060719 0.004983 0.188761 0.012677 0.024928 0.249732 0.262679 0.203712 0.042411 -0.218466 ... -0.008055 0.025444 0.277864 0.206736 -0.050947 0.043760 -0.004436 -0.003137 0.016455 0.283196
categories_Steam Leaderboards 0.090653 0.043821 0.058479 0.032171 0.044847 0.219241 0.292489 0.134663 -0.002466 0.037014 ... 0.019982 0.033554 0.129848 0.078662 -0.034492 0.021296 -0.003279 -0.002318 0.041432 0.209940
categories_SteamVR Collectibles 0.011220 0.031050 -0.009688 0.014078 -0.008416 0.007252 0.002165 0.000639 -0.000389 -0.016561 ... 1.000000 0.041529 -0.011303 -0.007761 0.002938 0.174176 -0.000331 -0.000234 -0.001858 0.021250
categories_Online Co-op 0.327593 0.541671 0.125707 0.026680 -0.157390 0.015186 0.018232 0.026572 -0.010028 0.026876 ... 0.041529 1.000000 0.062960 0.194615 0.187380 0.012078 -0.001744 -0.001233 0.009863 0.112115
categories_Shared/Split Screen 0.273619 0.107472 0.576333 -0.005737 -0.066024 0.092394 0.081981 0.068636 -0.018071 0.073484 ... -0.011303 0.062960 1.000000 0.508408 -0.031412 -0.012412 -0.002526 -0.001786 0.036821 0.162335
categories_Local Co-op 0.156427 0.088209 0.532393 0.001047 -0.023562 0.067056 0.066869 0.034050 -0.010888 0.044791 ... -0.007761 0.194615 0.508408 1.000000 -0.019197 -0.008359 -0.001734 -0.001226 0.010027 0.111010
categories_MMO 0.173948 0.233930 -0.020309 0.038259 -0.340444 -0.062278 -0.049431 0.013703 -0.015225 -0.005619 ... 0.002938 0.187380 -0.031412 -0.019197 1.000000 0.007816 -0.001080 -0.000764 0.012518 0.058146
categories_VR Support 0.022803 0.006648 -0.008103 0.015006 -0.011039 -0.005986 -0.003662 0.007455 0.014588 0.016447 ... 0.174176 0.012078 -0.012412 -0.008359 0.007816 1.000000 -0.000797 -0.000564 0.003855 0.045555
categories_Mods 0.020723 0.012146 -0.002165 0.145617 -0.036849 0.014255 0.008227 -0.005526 -0.001422 -0.003701 ... -0.000331 -0.001744 -0.002526 -0.001734 -0.001080 -0.000797 1.000000 0.707094 -0.000415 0.004748
categories_Mods (require HL2) 0.014653 -0.001933 -0.001531 0.102965 -0.026056 0.010079 0.005817 -0.003907 -0.001005 -0.002617 ... -0.000234 -0.001233 -0.001786 -0.001226 -0.000764 -0.000564 0.707094 1.000000 -0.000294 0.003357
categories_Steam Turn Notifications 0.081779 0.056302 0.062281 -0.002851 -0.009527 0.017686 0.033948 0.031304 0.001535 -0.018682 ... -0.001858 0.009863 0.036821 0.010027 0.012518 0.003855 -0.000415 -0.000294 1.000000 0.026680
categories_QTD 0.223955 0.156664 0.136928 0.032608 -0.041680 0.330935 0.574558 0.353633 0.089208 0.232807 ... 0.021250 0.112115 0.162335 0.111010 0.058146 0.045555 0.004748 0.003357 0.026680 1.000000

30 rows × 30 columns

Convenhamos, essa tabela de correlação ficou muito grande, meio inviável de se avaliar. Por isso, vou construir um heatmap abaixo pra ter uma noção um pouco melhor.

In [86]:
ax = sns.heatmap(df_corr.corr(), linewidths=.2, xticklabels=True, yticklabels=True)
ax.figure.set_size_inches(15,15)

Felizmente, a maioria das correlações não foi tão alta, porém, houveram algumas aqui e ali que incomodaram um pouco. Em especial, as colunas:

Variável A Variável B
'categories_Mods' 'categories_Mods (require HL2)'
'categories_Shared/Split Screen' 'categories_Local Co-op'
'categories_Online Co-op' 'categories_Online Multi-Player'
'categories_Co-op' 'categories_Local Co-op'
'categories_Co-op' 'categories_Online Co-op'

Houveram mais algumas correlações, mas, acredito que com essas que elenquei, já podemos ter uma ideia do que eventualmente jogar fora e o que manter. Já adicionamos as colunas binárias ao dataframe, então, podemos prosseguir para a próxima coluna.

In [87]:
df.columns
Out[87]:
Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'platforms', 'required_age', 'categories', 'genres', 'steamspy_tags',
       'achievements', 'positive_ratings', 'negative_ratings',
       'average_playtime', 'median_playtime', 'owners', 'price', 'Year',
       'platforms_windows', 'platforms_mac', 'platforms_linux',
       'categories_Multi-player', 'categories_Online Multi-Player',
       'categories_Local Multi-Player', 'categories_Valve Anti-Cheat enabled',
       'categories_Single-player', 'categories_Steam Cloud',
       'categories_Steam Achievements', 'categories_Steam Trading Cards',
       'categories_Captions available',
       'categories_Partial Controller Support',
       'categories_Includes Source SDK',
       'categories_Cross-Platform Multiplayer', 'categories_Stats',
       'categories_Commentary available', 'categories_Includes level editor',
       'categories_Steam Workshop', 'categories_In-App Purchases',
       'categories_Co-op', 'categories_Full controller support',
       'categories_Steam Leaderboards', 'categories_SteamVR Collectibles',
       'categories_Online Co-op', 'categories_Shared/Split Screen',
       'categories_Local Co-op', 'categories_MMO', 'categories_VR Support',
       'categories_Mods', 'categories_Mods (require HL2)',
       'categories_Steam Turn Notifications', 'categories_QTD'],
      dtype='object')
In [88]:
df['genres'].unique()
Out[88]:
array(['Action', 'Action;Free to Play', 'Action;Free to Play;Strategy',
       ...,
       'Action;Adventure;Indie;Massively Multiplayer;RPG;Strategy;Early Access',
       'Action;Adventure;Casual;Free to Play;Indie;RPG;Simulation;Sports;Strategy',
       'Casual;Free to Play;Massively Multiplayer;RPG;Early Access'],
      dtype=object)
In [89]:
len(df['genres'].unique())
Out[89]:
1552

Parece que teremos que fazer a mesma coisa que fizemos nas últimas variáveis qualitativas.

In [90]:
gen_split = df['genres'].apply(lambda x: pd.Series(x.split(';')))
In [91]:
gen_split
Out[91]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27070 Adventure Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27071 Action Adventure Indie NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27072 Action Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27073 Adventure Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27074 Adventure Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

27075 rows × 16 columns

In [92]:
gen = []
for index, row in gen_split.iterrows():
    for i in gen_split.columns:
        gen.append(row[i])
In [93]:
gen = pd.Series(gen)
In [94]:
gen.dropna(inplace=True)
In [95]:
len(gen.unique())
Out[95]:
29
In [96]:
gen.unique()
Out[96]:
array(['Action', 'Free to Play', 'Strategy', 'Adventure', 'Indie', 'RPG',
       'Animation & Modeling', 'Video Production', 'Casual', 'Simulation',
       'Racing', 'Violent', 'Massively Multiplayer', 'Nudity', 'Sports',
       'Early Access', 'Gore', 'Utilities', 'Design & Illustration',
       'Web Publishing', 'Education', 'Software Training',
       'Sexual Content', 'Audio Production', 'Game Development',
       'Photo Editing', 'Accounting', 'Documentary', 'Tutorial'],
      dtype=object)
In [97]:
gen.mode()
Out[97]:
0    Indie
dtype: object
In [98]:
gen.value_counts().head(10)
Out[98]:
Indie           19421
Action          11903
Casual          10210
Adventure       10032
Strategy         5247
Simulation       5194
RPG              4311
Early Access     2954
Free to Play     1704
Sports           1322
dtype: int64
In [99]:
gen.value_counts().tail(10)
Out[99]:
Education            51
Video Production     38
Software Training    31
Audio Production     29
Web Publishing       28
Game Development     17
Photo Editing        12
Accounting            6
Documentary           1
Tutorial              1
dtype: int64

Nik Davis, o cara que extraiu essa base, tinha comentado que poderiam haver alguns jogos que não eram realmente jogos, e sim software disponível na steam. Me parece que esses gêneros encontrados com a função tail() podem estar relacionados a esses softwares.

In [100]:
gen.value_counts().tail(50)
Out[100]:
Indie                    19421
Action                   11903
Casual                   10210
Adventure                10032
Strategy                  5247
Simulation                5194
RPG                       4311
Early Access              2954
Free to Play              1704
Sports                    1322
Racing                    1024
Violent                    843
Massively Multiplayer      723
Gore                       537
Nudity                     266
Sexual Content             245
Utilities                  146
Design & Illustration       87
Animation & Modeling        79
Education                   51
Video Production            38
Software Training           31
Audio Production            29
Web Publishing              28
Game Development            17
Photo Editing               12
Accounting                   6
Documentary                  1
Tutorial                     1
dtype: int64
In [101]:
gen.value_counts().tail(13).index.tolist()
Out[101]:
['Utilities',
 'Design & Illustration',
 'Animation & Modeling',
 'Education',
 'Video Production',
 'Software Training',
 'Audio Production',
 'Web Publishing',
 'Game Development',
 'Photo Editing',
 'Accounting',
 'Documentary',
 'Tutorial']

Esses parecem ser os gêneros a serem verificados. Vou filtrar o dataframe para dar uma olhada nos titulos desses "jogos".

In [102]:
nomes = []
for i in gen.value_counts().tail(13).index.tolist():
    
    nomes.extend(df[df['genres'].str.contains(i)]['name'].tolist())
In [103]:
set(nomes)
Out[103]:
{'001 Game Creator',
 '10k',
 '3D ParticleGen Visual FX',
 '3DCoat 4.8',
 '3DCoat Modding Tool',
 '3DF Zephyr Lite Steam Edition',
 '3DMark 11',
 '8 Queens',
 'ASMR Universe',
 'ASMR Universe 2',
 'Aartform Curvy 3D 3.0',
 'Amaze: Interactive VR Movies',
 'AnimVR',
 'AppGameKit: Easy Game Development',
 'Arcade Artist',
 'Arcane Mapper',
 'Atlantia Animation',
 'AutoTileGen',
 "Axis Game Factory's AGFPRO v3",
 'BASIC8',
 'BeanVR—The Social VR APP',
 'Belle II in Virtual Reality',
 'Blocks by Google',
 'Button Music',
 'Buzz Aldrin: Cycling Pathways to Mars',
 'CINEVR - The Movie Theater',
 'CPUCores :: Maximize Your FPS',
 "Cairo's Tale: The Big Egg",
 'Campfire Pro',
 'Catzilla 4K - Advanced',
 'Chernobyl VR Project',
 'ClearMem :: Free Up Your RAM',
 'Clickteam Fusion 2.5',
 'Club Lighting',
 'ControlMyJoystick',
 'Controller Companion',
 'CopperCube 5 Game Engine',
 'CyberLink PowerDVD 18 Ultra - Media player, video player, 4k media player, 360 video',
 'DJ Mix Pads',
 'DJ Streamer',
 'DashPanel',
 'DatavizVR Demo',
 'Dating Lessons',
 'DeoVR Video Player',
 'Desktop Audio Visualizer',
 'Desktop Mascot Engine',
 'Dimension Two',
 'Dimmdrive :: Gaming Ramdrive @ 10,000+ MB/s',
 'Discord Bot Maker',
 'Disney Movies VR',
 'DlodloVRPlayer',
 'DreamDesk VR',
 'Driver Fusion - The Best Driver & Device Solution',
 'Drum Simulator',
 'Dungeon Painter Studio',
 'EF EVE™ - Volumetric Video Platform (VR & Desktop)',
 'EF-12: Fighting Game Maker',
 'ENGAGE',
 'EVGA Precision XOC',
 'EXA: The Infinite Instrument',
 'Easy™ eSports',
 'Edmersiv',
 'Electronic Piano',
 'Emoto',
 'FaceRig',
 'Fantasynth: Chez Nous',
 'FireAlpaca SE',
 'FlatOut 4: Total Insanity Workshop Tool',
 'Float',
 'Félix VR',
 'GIPHY Museum of GIF Art',
 'GIPHY World VR',
 'GTGD S1: More Than a Gamer',
 'GTGD S3 How To Make A Game',
 'Game Character Hub',
 'Game Character Hub: Portfolio Edition',
 'GameGuru',
 'GameMaker Studio 2 Desktop',
 'GameMaker Studio 2 Mobile',
 'GameMaker Studio 2 UWP',
 'GameMaker Studio 2 Web',
 'Garage Drummer VR',
 'Geekbench 3',
 'GeoVox',
 'Ghost Mountain Roller Coaster',
 'GizmoVR Video Player',
 'GlobalMap Astro',
 'Godot Engine',
 'GraviSound',
 'Grove - VR Browsing Experience',
 'Handy Dice',
 'HomestarVR',
 'Houdini Indie',
 'Impromptu Vector Field Painter',
 'J.A.C.K.',
 'Janus VR',
 'Jaunt VR - Experience Cinematic Virtual Reality',
 "Jitsumi's Game Booster",
 'Juventus VR',
 'KOPICS',
 'Keyboard Engine',
 'KeysIn1',
 'Krita',
 'KumaKuma Manga Editor',
 'Kz NTools : Fix Your Network',
 'LIV',
 'Leadwerks Game Engine',
 'Leadwerks Game Launcher',
 'Liquid Rhythm',
 'Live Wallpaper Master',
 'Live2DViewerEX',
 'MINT VR',
 'MODO indie',
 'Magic Borderless',
 'Manga Maker Comipo',
 'MermaidVR Video Player',
 'Micro Cosmic Worlds',
 'Military Operations: Benchmark',
 'Mind Massaging Machine',
 'Mindshow',
 'MorphVOX Pro - Voice Changer',
 'Multiscreens',
 'My Desktop Alive',
 'MyRPG Master',
 'NOLO HOME',
 'Natural Locomotion',
 'Neos VR',
 'NexVR Video Player',
 'NextVR - Live Sports and Entertainment in Virtual Reality',
 'Nimble Writer',
 'Oats Studios - Volume 1 Assets',
 'PD Howler 9.6 Digital Painter and Visual FX box',
 'Platfinity™',
 'Pluto',
 'Printer Forge 3D',
 'Quantum Engine',
 'RAYBEEM - Live in Your Music',
 'RPG Maker MV',
 'RPG Maker VX Ace',
 'RPG Maker XP',
 'Raffle Blast',
 'RainDesktop',
 'RainWallpaper',
 'RainWidget',
 'Red Bull Doodle Art - Global VR Gallery',
 'Remembering Pearl Harbor',
 'Renegade Tabletop Tools System',
 'Room Designer VR',
 'Rumpus',
 'RutonyChat',
 'Rytmik Lite Chiptune Synthesizer',
 'Rytmik Player',
 'Rytmik Studio',
 'Rytmik Ultimate',
 'S2ENGINE HD',
 'SAO Utils: Beta',
 'SKYBOX VR Video Player',
 'SMILE GAME BUILDER',
 'SPATIAL SOUND CARD',
 'SRPG Studio',
 'STYLY',
 'SYMMETRY alpha',
 'Sankhara',
 'ScriptVOX Studio',
 'ShapeSim - Construction Set',
 "Shoot 'Em Up Kit",
 'Sim 4K VR MediaPlayer',
 'Sim Racing Telemetry',
 'Simmetri',
 'Source Filmmaker',
 'Spriter Pro',
 'Star Swarm Stress Test',
 'Strata inStudio VR',
 'Stream Avatars',
 'Stream Buddy',
 'Supermedium - Virtual Reality Browser',
 'TCPingInfoView',
 'TcNo TimeKeeper',
 'The Adliberum Engine (ADLENGINE)',
 'The Kremer Collection VR Museum',
 'The Music Room',
 'Tilt Brush',
 'TurnSignal',
 'Tvori',
 'Ultimate Desktop Character Engine',
 'Unbound',
 'VIRTUAnimator',
 'VR Fractals',
 'VR Meditation SkyRun',
 'VR Toolbox: 360 Desktop',
 'VRCapture',
 'VRTGO',
 'Vinyl Reality',
 'VirZOOM Gamepad Emulator',
 'Virtual Battlemap',
 'Virtual Desktop',
 'Virtual Idea Area',
 'Virtual telescope',
 'Visual Novel Engine',
 'VoiceAttack',
 'VoxVR',
 'Vrideo',
 'WITHIN',
 'Wallpaper Builder',
 'Wallpaper Engine',
 'Weelco VR',
 'Whirligig VR Media Player',
 'Wing IDE 5',
 'Witly - language tutoring in VR',
 'XDeskClock',
 'XLaunchpad',
 'XSprite',
 'XWidget',
 'articy:draft 3',
 'articy:draft 3 - Flex License',
 'connect - Virtual Home (3D or VR)',
 'ePic Character Generator',
 'iVRy Driver for SteamVR',
 'liteCam Game 5:  120 FPS Game Capture + Streamer',
 'liteCam Game: 100 FPS Game Capture',
 'liteCam HD 5.0: Screen/Stream Capture',
 'mdiapp+ SE',
 'openCanvas 6',
 'openCanvas 7',
 'rumii',
 'sheepChat',
 'vBuilder',
 '英语杀'}
In [104]:
len(set(nomes))
Out[104]:
228

Podemos ver que o número não é tão expressivo. E, de fato, varios dos "jogos" são, claramente, softwares de criação de jogos e afins. Vamos ver quais são as categorias ligadas a esses jogos.

In [105]:
df.columns
Out[105]:
Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'platforms', 'required_age', 'categories', 'genres', 'steamspy_tags',
       'achievements', 'positive_ratings', 'negative_ratings',
       'average_playtime', 'median_playtime', 'owners', 'price', 'Year',
       'platforms_windows', 'platforms_mac', 'platforms_linux',
       'categories_Multi-player', 'categories_Online Multi-Player',
       'categories_Local Multi-Player', 'categories_Valve Anti-Cheat enabled',
       'categories_Single-player', 'categories_Steam Cloud',
       'categories_Steam Achievements', 'categories_Steam Trading Cards',
       'categories_Captions available',
       'categories_Partial Controller Support',
       'categories_Includes Source SDK',
       'categories_Cross-Platform Multiplayer', 'categories_Stats',
       'categories_Commentary available', 'categories_Includes level editor',
       'categories_Steam Workshop', 'categories_In-App Purchases',
       'categories_Co-op', 'categories_Full controller support',
       'categories_Steam Leaderboards', 'categories_SteamVR Collectibles',
       'categories_Online Co-op', 'categories_Shared/Split Screen',
       'categories_Local Co-op', 'categories_MMO', 'categories_VR Support',
       'categories_Mods', 'categories_Mods (require HL2)',
       'categories_Steam Turn Notifications', 'categories_QTD'],
      dtype='object')
In [106]:
df[df['name'].isin(nomes)]['categories'].unique()
Out[106]:
array(['Steam Workshop', 'Steam Cloud', 'Steam Leaderboards',
       'Steam Trading Cards;Steam Workshop',
       'Steam Achievements;Steam Cloud;Stats', 'Single-player',
       'Single-player;Multi-player;MMO;Co-op;Shared/Split Screen;Cross-Platform Multiplayer;Steam Achievements;Steam Leaderboards',
       'Single-player;Steam Achievements;Steam Workshop;In-App Purchases;Partial Controller Support;Includes level editor',
       'Includes level editor',
       'Single-player;Multi-player;Partial Controller Support;Includes level editor',
       'Single-player;Steam Achievements;Steam Leaderboards',
       'Single-player;Multi-player;Includes level editor',
       'Single-player;Steam Achievements', 'Single-player;Steam Cloud',
       'Steam Achievements;Steam Trading Cards;Steam Workshop',
       'Single-player;Steam Trading Cards',
       'Multi-player;Cross-Platform Multiplayer',
       'VR Support;SteamVR Collectibles', 'Steam Trading Cards',
       'Includes Source SDK', 'Steam Achievements',
       'Steam Achievements;Partial Controller Support;Stats;Includes level editor',
       'Steam Achievements;Steam Workshop;Steam Cloud;Stats',
       'Full controller support;Steam Workshop;Steam Leaderboards',
       'Steam Trading Cards;Partial Controller Support',
       'Steam Achievements;Steam Trading Cards;Partial Controller Support;Includes level editor',
       'Full controller support;Steam Workshop',
       'Full controller support;Includes level editor',
       'Partial Controller Support;Includes level editor',
       'Steam Achievements;Steam Trading Cards',
       'VR Support;Steam Workshop', 'Steam Trading Cards;Steam Cloud',
       'VR Support', 'Captions available',
       'Steam Achievements;Steam Cloud',
       'Steam Achievements;Steam Trading Cards;Steam Workshop;Includes level editor',
       'Full controller support;VR Support',
       'VR Support;Partial Controller Support', 'Full controller support',
       'Single-player;Partial Controller Support',
       'Partial Controller Support', 'Single-player;Steam Workshop',
       'Steam Achievements;Steam Trading Cards;Steam Cloud;Stats;Includes level editor',
       'Steam Achievements;Includes level editor', 'Steam Cloud;Stats',
       'Steam Workshop;Partial Controller Support',
       'Single-player;Multi-player;Online Multi-Player',
       'Steam Workshop;Steam Cloud',
       'Partial Controller Support;Steam Cloud',
       'Single-player;Local Multi-Player;Local Co-op',
       'Steam Achievements;Steam Workshop;Steam Cloud',
       'Steam Achievements;Full controller support',
       'Single-player;Local Multi-Player;Local Co-op;Shared/Split Screen;Steam Achievements;Full controller support;Steam Trading Cards;Steam Cloud',
       'Steam Workshop;Partial Controller Support;Steam Cloud;Includes level editor',
       'Stats',
       'Steam Achievements;Steam Trading Cards;Steam Workshop;Steam Cloud;Stats',
       'In-App Purchases',
       'Single-player;Steam Trading Cards;Steam Cloud',
       'Steam Achievements;Steam Trading Cards;Steam Workshop;Steam Cloud',
       'Single-player;Multi-player;Online Multi-Player;Cross-Platform Multiplayer;Partial Controller Support',
       'Single-player;Multi-player;Steam Workshop',
       'Single-player;Multi-player;Online Multi-Player;Co-op;Online Co-op;Cross-Platform Multiplayer',
       'Single-player;Multi-player;Online Multi-Player;MMO;Co-op;Online Co-op;Includes level editor',
       'Captions available;Steam Cloud;Includes level editor',
       'Steam Workshop;Includes level editor',
       'Full controller support;In-App Purchases',
       'Steam Cloud;Includes level editor',
       'In-App Purchases;Includes level editor',
       'Single-player;Multi-player;Online Multi-Player;Cross-Platform Multiplayer;Includes level editor',
       'Steam Workshop;Partial Controller Support;Includes level editor',
       'Single-player;Full controller support;In-App Purchases',
       'Steam Workshop;Steam Cloud;Stats',
       'Single-player;Full controller support',
       'Online Multi-Player;Online Co-op;Cross-Platform Multiplayer',
       'Single-player;Captions available',
       'Single-player;Includes level editor',
       'Steam Workshop;Includes Source SDK',
       'Steam Achievements;Steam Workshop', 'In-App Purchases;Stats',
       'Single-player;Steam Cloud;Includes level editor',
       'Steam Workshop;Steam Cloud;Includes level editor'], dtype=object)
In [107]:
df[df['name'].isin(nomes)]['categories'].mode()
Out[107]:
0    Single-player
dtype: object
In [108]:
df[df['name'].isin(nomes)]['categories'].value_counts().head(10)
Out[108]:
Single-player                 34
Partial Controller Support    19
Steam Workshop                16
Steam Achievements            12
Steam Cloud                   11
Includes level editor         11
Full controller support        8
In-App Purchases               8
Steam Trading Cards            6
VR Support                     6
Name: categories, dtype: int64

Considerando tanto as categorias como os gêneros, é possível afirmar com um pouco mais de propriedade que essas entradas são, de fato, em grande maioria, ferramentas de desenvolvimento de jogos. Como o volume também é baixo, vou exclui-las do dataframe.

In [109]:
df = df[~df['name'].isin(nomes)]
In [110]:
df.head()
Out[110]:
appid name release_date english developer publisher platforms required_age categories genres ... categories_SteamVR Collectibles categories_Online Co-op categories_Shared/Split Screen categories_Local Co-op categories_MMO categories_VR Support categories_Mods categories_Mods (require HL2) categories_Steam Turn Notifications categories_QTD
0 10 Counter-Strike 2000-11-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action ... 0 0 0 0 0 0 0 0 0 2
1 20 Team Fortress Classic 1999-04-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action ... 0 0 0 0 0 0 0 0 0 2
2 30 Day of Defeat 2003-05-01 1 Valve Valve windows;mac;linux 0 Multi-player;Valve Anti-Cheat enabled Action ... 0 0 0 0 0 0 0 0 0 2
3 40 Deathmatch Classic 2001-06-01 1 Valve Valve windows;mac;linux 0 Multi-player;Online Multi-Player;Local Multi-P... Action ... 0 0 0 0 0 0 0 0 0 2
4 50 Half-Life: Opposing Force 1999-11-01 1 Gearbox Software Valve windows;mac;linux 0 Single-player;Multi-player;Valve Anti-Cheat en... Action ... 0 0 0 0 0 0 0 0 0 2

5 rows × 52 columns

Agora que limpamos a base de prováveis entradas de software, vale a pena reiniciar o processo de separação dos gêneros. Imagino que, como sobraram menos gêneros, talvez seja possível transforma-los em binário também.

In [111]:
df['genres'].unique()
Out[111]:
array(['Action', 'Action;Free to Play', 'Action;Free to Play;Strategy',
       ...,
       'Action;Adventure;Indie;Massively Multiplayer;RPG;Strategy;Early Access',
       'Action;Adventure;Casual;Free to Play;Indie;RPG;Simulation;Sports;Strategy',
       'Casual;Free to Play;Massively Multiplayer;RPG;Early Access'],
      dtype=object)
In [112]:
gen_split = df['genres'].apply(lambda x: pd.Series(x.split(';')))
In [113]:
gen_split
Out[113]:
0 1 2 3 4 5 6 7 8 9 10
0 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Action NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
27070 Adventure Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN
27071 Action Adventure Indie NaN NaN NaN NaN NaN NaN NaN NaN
27072 Action Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN
27073 Adventure Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN
27074 Adventure Casual Indie NaN NaN NaN NaN NaN NaN NaN NaN

26846 rows × 11 columns

In [114]:
gen = []
for index, row in gen_split.iterrows():
    for i in gen_split.columns:
        gen.append(row[i])
In [115]:
gen = pd.Series(gen)
In [116]:
gen.dropna(inplace=True)
In [117]:
gen.mode()
Out[117]:
0    Indie
dtype: object
In [118]:
gen.unique()
Out[118]:
array(['Action', 'Free to Play', 'Strategy', 'Adventure', 'Indie', 'RPG',
       'Casual', 'Simulation', 'Racing', 'Violent',
       'Massively Multiplayer', 'Nudity', 'Sports', 'Early Access',
       'Gore', 'Sexual Content'], dtype=object)
In [119]:
len(gen.unique())
Out[119]:
16
In [120]:
gen.value_counts().head(10)
Out[120]:
Indie           19399
Action          11901
Casual          10191
Adventure       10028
Strategy         5245
Simulation       5177
RPG              4309
Early Access     2908
Free to Play     1696
Sports           1322
dtype: int64
In [121]:
gen.value_counts().tail(10)
Out[121]:
RPG                      4309
Early Access             2908
Free to Play             1696
Sports                   1322
Racing                   1024
Violent                   842
Massively Multiplayer     721
Gore                      536
Nudity                    266
Sexual Content            245
dtype: int64
In [122]:
df_corr = pd.DataFrame()
df_corr
Out[122]:
In [123]:
for i in gen.unique().tolist():
    nome = 'genres_'+i
    df[nome] = df['genres'].str.contains(i, regex=False).astype(int)
    df_corr[nome] = df['genres'].str.contains(i, regex=False).astype(int)
In [124]:
df_corr.head(10)
Out[124]:
genres_Action genres_Free to Play genres_Strategy genres_Adventure genres_Indie genres_RPG genres_Casual genres_Simulation genres_Racing genres_Violent genres_Massively Multiplayer genres_Nudity genres_Sports genres_Early Access genres_Gore genres_Sexual Content
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [125]:
df_corr.corr()
Out[125]:
genres_Action genres_Free to Play genres_Strategy genres_Adventure genres_Indie genres_RPG genres_Casual genres_Simulation genres_Racing genres_Violent genres_Massively Multiplayer genres_Nudity genres_Sports genres_Early Access genres_Gore genres_Sexual Content
genres_Action 1.000000 -0.025227 -0.156237 0.012327 0.097356 -0.029661 -0.173166 -0.140261 0.022335 0.099691 0.053049 -0.023407 -0.010067 0.102506 0.077398 -0.040695
genres_Free to Play -0.025227 1.000000 0.030372 -0.043528 -0.051140 0.044122 0.003528 0.003859 -0.007747 -0.012468 0.286455 -0.004336 0.005295 0.026745 -0.012984 0.008891
genres_Strategy -0.156237 0.030372 1.000000 -0.142592 -0.025821 0.079887 -0.015110 0.144437 -0.064775 -0.029917 0.044826 -0.030324 -0.020096 0.043788 -0.031379 -0.028519
genres_Adventure 0.012327 -0.043528 -0.142592 1.000000 0.064279 0.167907 -0.039149 -0.100087 -0.075781 0.039973 -0.008250 0.024597 -0.115238 -0.019387 0.049975 0.012538
genres_Indie 0.097356 -0.051140 -0.025821 0.064279 1.000000 0.022734 0.103029 -0.043002 -0.032554 0.005045 -0.066387 -0.001858 -0.039328 0.075940 0.027172 0.004342
genres_RPG -0.029661 0.044122 0.079887 0.167907 0.022734 1.000000 -0.135654 -0.034968 -0.058467 0.003407 0.119434 0.019778 -0.068092 0.064402 0.002153 0.010325
genres_Casual -0.173166 0.003528 -0.015110 -0.039149 0.103029 -0.135654 1.000000 0.057542 -0.012311 -0.062811 -0.040214 0.024817 0.026661 -0.068142 -0.057875 0.030669
genres_Simulation -0.140261 0.003859 0.144437 -0.100087 -0.043002 -0.034968 0.057542 1.000000 0.085044 -0.018619 0.029179 0.005438 0.187221 0.090294 -0.023868 0.010677
genres_Racing 0.022335 -0.007747 -0.064775 -0.075781 -0.032554 -0.058467 -0.012311 0.085044 1.000000 -0.014635 0.024658 -0.015994 0.267448 0.025705 -0.011741 -0.015021
genres_Violent 0.099691 -0.012468 -0.029917 0.039973 0.005045 0.003407 -0.062811 -0.018619 -0.014635 1.000000 0.005799 0.193451 -0.020211 0.033552 0.655706 0.153530
genres_Massively Multiplayer 0.053049 0.286455 0.044826 -0.008250 -0.066387 0.119434 -0.040214 0.029179 0.024658 0.005799 1.000000 -0.009640 0.014371 0.100754 -0.002298 -0.013520
genres_Nudity -0.023407 -0.004336 -0.030324 0.024597 -0.001858 0.019778 0.024817 0.005438 -0.015994 0.193451 -0.009640 1.000000 -0.021029 -0.010665 0.157789 0.599445
genres_Sports -0.010067 0.005295 -0.020096 -0.115238 -0.039328 -0.068092 0.026661 0.187221 0.267448 -0.020211 0.014371 -0.021029 1.000000 0.044756 -0.020177 -0.018220
genres_Early Access 0.102506 0.026745 0.043788 -0.019387 0.075940 0.064402 -0.068142 0.090294 0.025705 0.033552 0.100754 -0.010665 0.044756 1.000000 0.024796 -0.022106
genres_Gore 0.077398 -0.012984 -0.031379 0.049975 0.027172 0.002153 -0.057875 -0.023868 -0.011741 0.655706 -0.002298 0.157789 -0.020177 0.024796 1.000000 0.106715
genres_Sexual Content -0.040695 0.008891 -0.028519 0.012538 0.004342 0.010325 0.030669 0.010677 -0.015021 0.153530 -0.013520 0.599445 -0.018220 -0.022106 0.106715 1.000000
In [126]:
ax = sns.heatmap(df_corr.corr(), linewidths=.2, xticklabels=True, yticklabels=True)
ax.figure.set_size_inches(15,15)

Nesse caso, houveram apenas 2 "grandes" correlações.

Variável A Variável B
genres_Nudity genres_Sexual Content
genres_Violent genres_Gore

Provavelmente teremos que escolher 1 variável de cada linha para utilizar na eventual regressão. Vamos para a próxima variável

In [127]:
df['achievements'].unique()
Out[127]:
array([   0,   33,  147,   54,   13,   15,   22,  520,   73,   70,   51,
         66,  167,   44,  285,   12,   69,   21,   14,   37,   29,   67,
         61,   50,   80,  286,   60,   11,  709,   20,   38,   30,   40,
         10,   17,   25,   52,   71,   57,    7,   36,    9,   16,    4,
        106,  247,   58,    8,   87,   45,   27,   18,   92,   19,   43,
         75, 1165,  169,   56,   23,  260,   28,   59,   24,   26,   47,
        140,   97,  145,   49,  322,   64,   35,   62,   81,   41,   68,
         88,    1,   63,   48,   83,  129,  110,  114,   90,   77,   65,
         32,   42,  100,   46,  115,  124,  220,  251,  104,   99,  650,
         34,   31,   53,   85,  119,   98,  161,   74,  412,   91,   93,
        113,  849,  111,    2,  150,  213,   82,  107,  188,  146,   95,
       1130,  178,  103,  294,    6,  123,   89,   39,  210,  109,  141,
        187,  452,  257,  134,  108,  295,  300,  120,    3,  183,  105,
        490,   72,  199,  237,  403,  121,  159,   86, 1448,  175,   76,
       1746,  125,  131,   55,  152,    5,  127,  130,  236,   78,  256,
         79,  139,   84,  241,  323,  419,  165,  102,   96,  170,  126,
        319,  163,  118,  250,  361,  172,  116,   94,  282, 1559,  185,
        136,  112,  160,  198,  128,  176,  265,  179,  143,  354,  180,
        226,  208, 1080,  572,  135,  132, 1002,  408,  240, 1095,  157,
       4034,  202,  164,  608,  133, 5000,  101,  394,  334, 1317, 2880,
       1224,  142,  217,  162,  480,  224,  193, 1090,  206,  601,  229,
       2448,  137,  302, 9821, 1708, 2995,  484,  590,  464,  173,  303,
        151,  727,  201,  555,  400,  117,  500, 1009,  720, 4094, 4765,
       1043,  365,  315,  153,  505,  888,  184,  222, 5394,  678, 2130,
        225,  166, 2008,  252,  604,  270,  526, 2516,  168,  191,  243,
        350,  324, 1510,  272, 2007,  750, 1500,  674, 4896,  192,  219,
       3039,  362, 1215, 1176, 4564, 1596, 2090,  200,  961,  156,  602,
        348,  316,  309, 1011,  658, 1018,  332,  177, 4732, 1512,  998,
       2020,  186,  281, 2126, 1487, 4251,  360, 2134,  158, 2659, 4979,
       2001,  317, 1602,  777, 3000, 1315, 1819,  543, 2042, 2638,  154,
       4497, 1070, 1505,  212, 4999, 4013,  482,  299,  345,  204, 2306,
       1998, 1364,  428,  333, 2976, 3600,  296, 4884,  258, 1800, 2230,
        330, 2021,  291,  298,  235, 1266, 1113, 4996, 2220, 3006, 4293,
       1000,  284,  451, 1327,  800,  215, 1010,  283,  999,  232,  754,
       4981, 3340,  574,  510, 4989, 4757,  995,  279, 4956,  144,  337,
       4987,  190, 1970, 1172, 4977, 4585,  138, 1100,  565, 1730,  196,
        600, 4805, 3346, 1337, 3452,  357,  952, 1203, 1581, 4997,  630,
       2043,  308,  233], dtype=int64)

Pelo que pode ser visto, esse campo já está em formato numérico, o que facilita muita coisa. Podemos dar uma olhada no seu histograma, apenas pra ter uma noção melhor da distribuição dos dados.

In [128]:
ax = sns.histplot(data=df['achievements'])
ax.figure.set_size_inches(15,6)

Parece que o histograma não foi muito útil como forma de visualização desses dados. Vamos tentar o boxplot.

In [129]:
ax = sns.boxplot(data=df['achievements'])
ax.figure.set_size_inches(15,6)
In [130]:
df['achievements'].mean()
Out[130]:
45.613536467257696
In [131]:
df['achievements'].std()
Out[131]:
354.1483340105123
In [132]:
df['achievements'].var()
Out[132]:
125421.04248242141
In [133]:
df['achievements'].max()
Out[133]:
9821
In [134]:
df['achievements'].min()
Out[134]:
0
In [135]:
df['achievements']
Out[135]:
0         0
1         0
2         0
3         0
4         0
         ..
27070     7
27071     0
27072    24
27073     0
27074     0
Name: achievements, Length: 26846, dtype: int64
In [136]:
df['achievements'].dropna()
Out[136]:
0         0
1         0
2         0
3         0
4         0
         ..
27070     7
27071     0
27072    24
27073     0
27074     0
Name: achievements, Length: 26846, dtype: int64

Bom, parece que esse campo é um pouco estranho mesmo. O desvio padrão é bem alto, o que indica que poucos valores de achievements se repetem entre os jogos, e que eles não estão exatamente próximos um dos outros.
Vamos pra próxima variável.

In [137]:
df['positive_ratings'].unique()
Out[137]:
array([124534,   3318,   3416, ...,   1720,   2966,   1490], dtype=int64)
In [138]:
len(df['positive_ratings'].unique())
Out[138]:
2791
In [139]:
df['positive_ratings'].max()
Out[139]:
2644404
In [140]:
df['positive_ratings'].min()
Out[140]:
0
In [141]:
df['positive_ratings'].mean()
Out[141]:
1004.7550845563585
In [142]:
df['positive_ratings'].std()
Out[142]:
19065.287190135652
In [143]:
df['positive_ratings'].var()
Out[143]:
363485175.64235055

Tentei plotar os gráficos de histograma e boxplot, mas, não deu muito certo, muito tempo de processamento, preferi retirar da análise.

Esse campo diz qual a quantidade de análises positivas que um dado jogo tem. Ele tem um par, que é o campo 'negative_ratings'. Podemos verificar se existe alguma correlação forte entre eles.

In [144]:
df['negative_ratings'].unique()
Out[144]:
array([3339,  633,  398, ..., 1780, 1332, 1035], dtype=int64)
In [145]:
len(df['negative_ratings'].unique())
Out[145]:
1488
In [146]:
df['negative_ratings'].max()
Out[146]:
487076
In [147]:
df['negative_ratings'].min()
Out[147]:
0
In [148]:
df['negative_ratings'].mean()
Out[148]:
212.34172688668704
In [149]:
df['negative_ratings'].std()
Out[149]:
4303.099997223416
In [150]:
df['negative_ratings'].var()
Out[150]:
18516669.58610416

Vamos a correlação

In [151]:
df[['negative_ratings','positive_ratings']].corr()
Out[151]:
negative_ratings positive_ratings
negative_ratings 1.000000 0.762909
positive_ratings 0.762909 1.000000

Como já era de se esperar, a correlação entre as avaliações realmente é consideravel, porém, é interessante notar que elas tem uma correlação positiva. Isso pode indicar que elas são, na verdade, proporcionais a quantidade de avaliações. Como elas tem uma correlação bem alta, vou criar uma terceira variável, que será a subtração dessas duas, e vou usar ela na eventual regressão.

In [152]:
df['delta_ratings'] = df['positive_ratings'] - df['negative_ratings']
In [153]:
df['delta_ratings'].unique()
Out[153]:
array([121195,   2685,   3018, ...,    472,    600,    611], dtype=int64)
In [154]:
len(df['delta_ratings'].unique())
Out[154]:
2712
In [155]:
df['delta_ratings'].max()
Out[155]:
2242091
In [156]:
df['delta_ratings'].min()
Out[156]:
-7738
In [157]:
df['delta_ratings'].mean()
Out[157]:
792.4133576696714
In [158]:
df['delta_ratings'].std()
Out[158]:
16025.730333753105
In [159]:
df['delta_ratings'].var()
Out[159]:
256824032.73017442

Vou calcular também uma nota média baseada no percentual de votos positivos.

In [160]:
df['%_ratings'] = df['positive_ratings']/(df['positive_ratings'] + df['negative_ratings'])
In [161]:
df['%_ratings'].unique()
Out[161]:
array([0.97388815, 0.8397874 , 0.89564761, ..., 0.64197531, 0.68209877,
       0.77692308])
In [162]:
len(df['%_ratings'].unique())
Out[162]:
8402
In [163]:
df['%_ratings'].max()
Out[163]:
1.0
In [164]:
df['%_ratings'].min()
Out[164]:
0.0
In [165]:
df['%_ratings'].mean()
Out[165]:
0.7144242187237336

Parece que encontramos um outro ponto de atenção. É bem provável que essas entradas com nota percentual 1 e 0 sejam, na verdade, casos com poucas avaliações. Será que o número de avaliações pode ter influência no seu preço? Na quantidade, eu imagino que tenha.

In [166]:
df['QTD_ratings'] = (df['positive_ratings'] + df['negative_ratings'])

Como comentado acima, vamos dar uma olhada nas entradas com '%_ratings' com valor 1 e 0

In [167]:
rating_1 = df[df['%_ratings'] == df['%_ratings'].max()][['name','positive_ratings','negative_ratings','owners','QTD_ratings']].copy()
rating_1
Out[167]:
name positive_ratings negative_ratings owners QTD_ratings
679 Kuros 7 0 0-20000 7
694 Wandering Willows™ 20 0 0-20000 20
700 Fitness Dash™ 8 0 0-20000 8
713 Build-A-Lot 2: Town of the Year 3 0 0-20000 3
714 Build-A-Lot 3: Passport to Europe 2 0 0-20000 2
... ... ... ... ... ...
27067 CaptainMarlene 13 0 0-20000 13
27068 Old Edge II 4 0 0-20000 4
27070 Room of Pandora 3 0 0-20000 3
27073 New Yankee 7: Deer Hunters 2 0 0-20000 2
27074 Rune Lord 4 0 0-20000 4

2882 rows × 5 columns

In [168]:
rating_0 = df[df['%_ratings'] == df['%_ratings'].min()][['name','positive_ratings','negative_ratings','owners','QTD_ratings']].copy()
rating_0
Out[168]:
name positive_ratings negative_ratings owners QTD_ratings
577 Raycatcher 0 13 0-20000 13
683 Angle of Attack 0 2 0-20000 2
4503 Pluck 0 3 0-20000 3
4742 ProtoMasons 0 2 0-20000 2
4757 Vol'Talkes - The AI War 0 2 0-20000 2
... ... ... ... ... ...
27008 DevilShaft: TheTower 0 1 0-20000 1
27051 Space Trade Fleet 1.5 0 1 0-20000 1
27056 Hello, World. 0 1 0-20000 1
27060 地铁:恐怖末班车 0 2 0-20000 2
27072 Super Star Blast 0 1 0-20000 1

661 rows × 5 columns

In [169]:
df[['delta_ratings','%_ratings','QTD_ratings']].corr()
Out[169]:
delta_ratings %_ratings QTD_ratings
delta_ratings 1.00000 0.035700 0.955830
%_ratings 0.03570 1.000000 0.025694
QTD_ratings 0.95583 0.025694 1.000000

Como existe uma correlação alta entre QTD_ratings e delta_ratings, vou usar só o QTD_ratings.

In [170]:
rating_1['owners'].unique()
Out[170]:
array(['0-20000', '20000-50000'], dtype=object)
In [171]:
rating_0['owners'].unique()
Out[171]:
array(['0-20000', '20000-50000'], dtype=object)
In [172]:
df['owners'].unique()
Out[172]:
array(['10000000-20000000', '5000000-10000000', '2000000-5000000',
       '20000000-50000000', '100000000-200000000', '50000000-100000000',
       '20000-50000', '500000-1000000', '100000-200000', '50000-100000',
       '1000000-2000000', '200000-500000', '0-20000'], dtype=object)

Como visto acima, a maioria dessas entradas apresenta valores comparativamente baixos no campo "owners", que são as pessoas que detem esses jogos em suas bibliotecas. Podemos analisar e fazer alguns tratamentos no campo 'owners', pra, depois, ter uma ideia de qual o percentual médio de pessoas detentoras do jogo que o avaliam.

In [173]:
len(df['owners'].unique())
Out[173]:
13
In [174]:
own_split = df['owners'].apply(lambda x: pd.Series(x.split('-')))
In [175]:
df['owners_mean'] = (own_split[0].astype(int)+own_split[1].astype(int))/2
In [176]:
df['owners_mean']
Out[176]:
0        15000000.0
1         7500000.0
2         7500000.0
3         7500000.0
4         7500000.0
            ...    
27070       10000.0
27071       10000.0
27072       10000.0
27073       10000.0
27074       10000.0
Name: owners_mean, Length: 26846, dtype: float64
In [177]:
rating_1 = df[df['%_ratings'] == df['%_ratings'].max()][['name','positive_ratings','negative_ratings','owners','QTD_ratings','owners_mean']]
In [178]:
rating_0 = df[df['%_ratings'] == df['%_ratings'].min()][['name','positive_ratings','negative_ratings','owners','QTD_ratings','owners_mean']]
In [179]:
rating_1['owners_mean'].unique()
Out[179]:
array([10000., 35000.])
In [180]:
rating_0['owners_mean'].unique()
Out[180]:
array([10000., 35000.])
In [181]:
temp = df[['QTD_ratings','positive_ratings','owners_mean']].groupby('owners_mean').mean().round(2)
temp
Out[181]:
QTD_ratings positive_ratings
owners_mean
10000.0 40.73 30.74
35000.0 235.27 180.27
75000.0 502.16 385.48
150000.0 1065.38 839.09
350000.0 2206.54 1753.69
750000.0 6325.22 5142.36
1500000.0 14237.02 11781.62
3500000.0 30292.18 26056.07
7500000.0 81027.04 69231.24
15000000.0 150476.43 126773.57
35000000.0 373593.33 344998.00
75000000.0 2014988.50 1570294.00
150000000.0 1005586.00 863507.00
In [182]:
temp.reset_index(inplace=True)
In [183]:
temp
Out[183]:
owners_mean QTD_ratings positive_ratings
0 10000.0 40.73 30.74
1 35000.0 235.27 180.27
2 75000.0 502.16 385.48
3 150000.0 1065.38 839.09
4 350000.0 2206.54 1753.69
5 750000.0 6325.22 5142.36
6 1500000.0 14237.02 11781.62
7 3500000.0 30292.18 26056.07
8 7500000.0 81027.04 69231.24
9 15000000.0 150476.43 126773.57
10 35000000.0 373593.33 344998.00
11 75000000.0 2014988.50 1570294.00
12 150000000.0 1005586.00 863507.00
In [184]:
temp['QTD_ratings'].astype(float)/temp['owners_mean'].astype(float)
Out[184]:
0     0.004073
1     0.006722
2     0.006695
3     0.007103
4     0.006304
5     0.008434
6     0.009491
7     0.008655
8     0.010804
9     0.010032
10    0.010674
11    0.026867
12    0.006704
dtype: float64
In [185]:
(temp['QTD_ratings'].astype(float)/temp['owners_mean'].astype(float)).mean()
Out[185]:
0.009427474183150182
In [186]:
temp['positive_ratings'].astype(float)/temp['owners_mean'].astype(float)
Out[186]:
0     0.003074
1     0.005151
2     0.005140
3     0.005594
4     0.005011
5     0.006856
6     0.007854
7     0.007445
8     0.009231
9     0.008452
10    0.009857
11    0.020937
12    0.005757
dtype: float64

Podemos perceber que o percentual de owners que avalia os jogos não é exatamente constante entre todas as médias.

In [187]:
len(df['owners_mean'])
Out[187]:
26846
In [188]:
df['owners_mean'].hist()
Out[188]:
<AxesSubplot:>
In [189]:
len(df[df['owners_mean'] == df['owners_mean'].min()]['owners_mean'])
Out[189]:
18433
In [190]:
df[df['owners_mean'] == df['owners_mean'].min()]['owners_mean']
Out[190]:
122      10000.0
124      10000.0
145      10000.0
188      10000.0
248      10000.0
          ...   
27070    10000.0
27071    10000.0
27072    10000.0
27073    10000.0
27074    10000.0
Name: owners_mean, Length: 18433, dtype: float64

Parece que a grande maioria dos jogos nesse dataset tem, em média, 10 mil owners. Acho que vale a pena separar os dados em 3 grupos então, 1 somente com jogos com mais de 10 mil owners de média, um com todos os jogos, e outro com somente os jogos com média de 10 mil owners. Pra não atrapalhar o fluxo até agora, vou deixar pra fazer essa separação depois de ter efetuado as análises nas outras variáveis.

In [191]:
df["average_playtime"].unique()
Out[191]:
array([17612,   277,   187, ...,  1100,  6332,   816], dtype=int64)
In [192]:
len(df["average_playtime"].unique())
Out[192]:
1338
In [193]:
df["average_playtime"].head()
Out[193]:
0    17612
1      277
2      187
3      258
4      624
Name: average_playtime, dtype: int64
In [194]:
df["average_playtime"].max()
Out[194]:
190625
In [195]:
df["average_playtime"].min()
Out[195]:
0

Parece que existem jogos que foram jogados, em média, 0 horas. Acho que é uma boa ideia dar uma olhadinha neles.

In [196]:
df[df["average_playtime"] == df["average_playtime"].min()][['name','average_playtime','median_playtime']]
Out[196]:
name average_playtime median_playtime
26 Rag Doll Kung Fu 0 0
29 SiN Episodes: Emergence 0 0
34 Dangerous Waters 0 0
35 Space Empires IV Deluxe 0 0
36 Disciples II: Rise of the Elves 0 0
... ... ... ...
27070 Room of Pandora 0 0
27071 Cyber Gun 0 0
27072 Super Star Blast 0 0
27073 New Yankee 7: Deer Hunters 0 0
27074 Rune Lord 0 0

20709 rows × 3 columns

In [197]:
df[df["average_playtime"] == df["average_playtime"].min()][['name','average_playtime','median_playtime']].shape
Out[197]:
(20709, 3)

A grande maioria dos jogos não possui horas significativas de jogatina média ou mediana. Provavelmente não vai ajudar muito considerar o tempo de jogatina de cada jogo. Por via das dúvidas, vou criar uma coluna que apenas computa se o average_playtime é maior que 0.

In [198]:
df["average_playtime_>0"] = df["average_playtime"].apply(lambda x: x>0).astype(int)
In [199]:
ax = sns.histplot(df["average_playtime"], stat='probability')
ax.figure.set_size_inches(15,6)

Bom, por fim, falta avaliarmos a variável 'price'. Lembrando que os preços encontrados aqui estão em libra.

In [200]:
df['price'].unique().round(2)
Out[200]:
array([7.1900e+00, 3.9900e+00, 0.0000e+00, 5.7900e+00, 5.9900e+00,
       1.4990e+01, 6.9900e+00, 2.2990e+01, 4.9900e+00, 1.0990e+01,
       2.9900e+00, 4.2900e+00, 1.2990e+01, 2.4900e+00, 9.9900e+00,
       1.9900e+00, 1.5990e+01, 3.4900e+00, 2.4990e+01, 8.5900e+00,
       7.9900e+00, 4.4900e+00, 4.2500e+00, 5.5900e+00, 2.5990e+01,
       8.9900e+00, 4.7900e+00, 1.3990e+01, 8.2900e+00, 9.2900e+00,
       1.9990e+01, 1.7900e+00, 1.6990e+01, 8.6900e+00, 7.4900e+00,
       7.9000e-01, 1.6900e+00, 1.5490e+01, 1.4900e+00, 1.7990e+01,
       1.1990e+01, 3.5960e+01, 6.4900e+00, 1.5900e+00, 3.4990e+01,
       9.9000e-01, 5.1900e+00, 2.0900e+00, 2.9990e+01, 1.9980e+01,
       2.7900e+00, 2.8900e+00, 3.4950e+01, 1.2490e+01, 1.1390e+01,
       4.3900e+00, 1.8990e+01, 1.0590e+01, 6.1900e+00, 3.9990e+01,
       2.6990e+01, 3.0990e+01, 6.0000e+00, 2.3990e+01, 3.3990e+01,
       1.2390e+01, 8.4900e+00, 2.3790e+01, 1.9490e+01, 3.8990e+01,
       2.3390e+01, 6.9500e+00, 1.1490e+01, 1.8740e+01, 2.0990e+01,
       5.4900e+00, 1.3490e+01, 1.7980e+01, 2.7790e+01, 2.7110e+01,
       4.4980e+01, 1.1690e+01, 2.2900e+00, 5.4990e+01, 4.4990e+01,
       2.7000e+01, 4.6500e+00, 1.4490e+01, 1.3500e+01, 1.3590e+01,
       1.6300e+01, 1.1000e+01, 4.9990e+01, 7.5000e+00, 3.2900e+00,
       1.0290e+01, 1.7490e+01, 1.3000e+01, 4.5000e+00, 2.7990e+01,
       8.8900e+00, 4.5900e+00, 6.0990e+01, 6.6900e+00, 3.1990e+01,
       3.2990e+01, 3.6990e+01, 9.0000e+00, 3.1900e+00, 5.6900e+00,
       1.5000e+00, 2.2490e+01, 1.0000e+00, 3.5000e+00, 4.1990e+01,
       2.5790e+01, 1.3370e+01, 7.0000e+00, 3.9950e+01, 6.3500e+00,
       6.1500e+00, 1.0000e+01, 3.2000e-01, 3.2500e+01, 2.5900e+00,
       7.5000e-01, 5.9000e-01, 1.9900e+01, 4.2400e+00, 3.2300e+00,
       1.1240e+01, 4.9000e-01, 5.8000e+00, 2.7100e+01, 2.1000e+01,
       2.2800e+00, 9.0000e-01, 6.2900e+00, 6.3900e+00, 4.1200e+00,
       2.1990e+01, 6.5900e+00, 5.0000e+00, 1.8390e+01, 3.9200e+00,
       1.5000e+01, 1.8100e+01, 7.2000e-01, 1.1950e+01, 4.9500e+00,
       1.5500e+00, 2.0510e+01, 1.1750e+01, 2.6900e+00, 1.2000e+01,
       1.5499e+02, 7.2000e+00, 5.5400e+00, 4.4800e+01, 1.4690e+01,
       3.1000e+00, 4.6900e+00, 2.5000e+00, 1.6500e+00, 9.9000e+00,
       1.8490e+01, 2.3000e+00, 7.5990e+01, 2.3900e+00, 5.9500e+00,
       7.9000e+00, 7.8000e-01, 1.5900e+01, 1.6660e+01, 7.7700e+00,
       1.0490e+01, 2.0000e+00, 4.6490e+01, 7.5900e+00, 3.5900e+00,
       2.8500e+00, 1.4950e+01, 3.7900e+00, 4.7000e+00, 7.7000e+00,
       6.7900e+00, 3.7990e+01, 7.0900e+00, 1.0790e+01, 1.6740e+01,
       4.0000e+00, 2.3600e+00, 2.2400e+00, 3.5300e+00, 8.9000e-01,
       2.4000e+01, 8.0000e+00, 8.0000e-01, 2.8000e+00, 1.9500e+00,
       2.4900e+01, 7.1200e+00, 2.0700e+00, 1.2580e+01, 3.0900e+00,
       1.4300e+00, 1.4210e+01, 3.0000e+00, 5.0500e+00, 3.0400e+00,
       7.4000e-01, 1.1890e+01, 2.1300e+00, 7.7400e+00, 2.4500e+01,
       3.8900e+00, 1.1000e+00, 6.2990e+01, 7.8990e+01, 1.6800e+00,
       2.1000e+00, 1.3000e+00, 1.9000e+00, 1.1900e+01, 7.1800e+00,
       8.9000e+00, 2.1900e+00, 1.9800e+00, 2.1500e+00, 4.9900e+01,
       1.0100e+00, 6.8900e+00, 4.2199e+02, 3.0300e+00, 6.2800e+00,
       1.2890e+01, 1.4000e+01, 3.1400e+00, 8.5000e-01, 1.1400e+01,
       1.0900e+01, 1.2000e+00, 1.5800e+01, 5.3400e+00, 1.1900e+00,
       1.7000e+00, 6.9990e+01, 4.2000e+00, 2.0000e+01, 6.2000e+00,
       1.7000e+01, 1.1290e+01, 1.1500e+01, 1.1790e+01, 8.0200e+00,
       2.2600e+00, 5.3900e+00, 9.9500e+00, 1.8000e+00, 3.9800e+00,
       1.5400e+01, 6.5400e+00, 3.1500e+00, 2.3500e+00, 1.8900e+01,
       3.0200e+00, 7.4200e+00, 7.7000e-01, 2.9500e+00, 5.2000e+00,
       8.5000e+00, 6.1000e+00])
In [201]:
len(df['price'].unique())
Out[201]:
272
In [202]:
df['price'].hist()
Out[202]:
<AxesSubplot:>
In [203]:
from pingouin import normality
In [204]:
normality(df['price'],method='normaltest')
Out[204]:
W pval normal
price 44767.046882 0.0 False
In [205]:
df['log_price'] = np.log(df['price']+1)
In [206]:
df['log_price'].hist()
Out[206]:
<AxesSubplot:>
In [207]:
df['log_log_price'] = np.log(df['log_price']+1)
In [208]:
df['log_log_price'].hist()
Out[208]:
<AxesSubplot:>

Agora que todas as variáveis foram avaliadas, vou exportar 3 bases diferentes, baseado na média de owners, como dito anteriormente. Farei as regressões em outro arquivo.

In [209]:
Colunas = ['appid','name','Year','english','platforms_windows','platforms_mac','platforms_linux','genres_Action',
           'genres_Free to Play', 'genres_Strategy','genres_Adventure', 'genres_Indie', 'genres_RPG',
           'genres_Casual','genres_Simulation', 'genres_Racing', 'genres_Violent','genres_Massively Multiplayer',
           'genres_Nudity', 'genres_Sports',       'genres_Early Access', 'genres_Gore', 'genres_Sexual Content',
           'categories_Multi-player', 'categories_Online Multi-Player','categories_Local Multi-Player',
           'categories_Valve Anti-Cheat enabled','categories_Single-player', 'categories_Steam Cloud',
           'categories_Steam Achievements', 'categories_Steam Trading Cards','categories_Captions available',
           'categories_Partial Controller Support','categories_Includes Source SDK','categories_Cross-Platform Multiplayer',
           'categories_Stats','categories_Commentary available', 'categories_Includes level editor',
           'categories_Steam Workshop', 'categories_In-App Purchases','categories_Co-op',
           'categories_Full controller support','categories_Steam Leaderboards', 'categories_SteamVR Collectibles',
           'categories_Online Co-op', 'categories_Shared/Split Screen','categories_Local Co-op', 'categories_MMO', 
           'categories_VR Support','categories_Mods', 'categories_Mods (require HL2)',
           'categories_Steam Turn Notifications','achievements','%_ratings','QTD_ratings','owners_mean',
           'average_playtime_>0','price','log_price','log_log_price']
In [210]:
output_owners_min = df[df['owners_mean'] == df['owners_mean'].min()][Colunas]
In [211]:
output_owners = df[df['owners_mean'] != df['owners_mean'].min()][Colunas]
In [212]:
output_geral = df[Colunas]
In [213]:
output_owners_min.to_csv('Dados Tratados/output_owners_min.csv', sep=';')
In [214]:
output_owners.to_csv('Dados Tratados/output_owners.csv', sep=';')
In [215]:
output_geral.to_csv('Dados Tratados/output_geral.csv', sep=';')

Dados tratados e discriminados, agora vamos para as regressões... no arquivo "Regressões.ipynb" dentro da mesma pasta, ou, acessando ao link: https://bhscjhdvds.github.io/Minor/Regress%C3%B5es.html