Pandas+Seaborn+Plotly：联手探索苹果 AppStore

作者：Peter

2022 年 2 月 26 日
本文字数：6163 字
阅读完需：约 20 分钟

公众号：尤而小屋<br>作者：Peter<br>编辑：Peter

大家好，我是 Peter~

今天给大家分享一篇 kaggle 实战的新文章：基于 Seaborn+Plotly 的 AppleStore 可视化探索，这是一篇完全基于统计+可视化的数据分析案例。

原 notebook 只用了 seaborn 库，很多图形小编用 plotly 进行了实现，原文章地址：https://www.kaggle.com/adityapatil673/visual-analysis-of-apps-on-applestore/notebook

导入库

import pandas as pdimport numpy as np
# 可视化from matplotlib import pyplot as pltimport seaborn as sns
import plotly_express as pximport plotly.graph_objects as go

复制代码

数据基本信息

读取并且查看基本信息：

 # 1、整体大小data.shape(7197, 16)
# 2、缺失值data.isnull().sum()
id                  0track_name          0size_bytes          0currency            0price               0rating_count_tot    0rating_count_ver    0user_rating         0user_rating_ver     0ver                 0cont_rating         0prime_genre         0sup_devices.num     0ipadSc_urls.num     0lang.num            0vpp_lic             0dtype: int64  # 3、字段类型data.dtypes
id                    int64track_name           objectsize_bytes            int64currency             objectprice               float64rating_count_tot      int64rating_count_ver      int64user_rating         float64user_rating_ver     float64ver                  objectcont_rating          objectprime_genre          objectsup_devices.num       int64ipadSc_urls.num       int64lang.num              int64vpp_lic               int64dtype: object

复制代码

一般情况下，也会查看数据的描述统计信息（针对数值型的字段）：

APP 信息统计

免费的 APP 数量

sum(data.price == 0)
4056

复制代码

价格超过 50 的 APP 数量

价格大于 50 即表示为：超贵（原文：super expensive apps）

sum(data.price >= 50)
7

复制代码

价格超过 50 的比例

sum((data.price > 50) / len(data.price) * 100)
0.09726274836737528

复制代码

# 个人写法sum(data.price >= 50) / len(data) * 100
0.09726274836737529

复制代码

离群数据

价格超过 50 的 APP 信息

outlier = data[data.price > 50][['track_name','price','prime_genre','user_rating']]outlier

复制代码

免费 APP

选择免费 APP 的数据信息

正常区间的 APP

取数

paidapps = data[(data["price"] > 0) & (data.price < 50)]
# 正常价格区间的最大值和最小值print("max_price:", max(paidapps.price))print("min_price:", min(paidapps.price))
max_price: 49.99min_price: 0.99

复制代码

价格分布

plt.style.use("fivethirtyeight")plt.figure(figsize=(12,10))
# 1、绘制直方图# 2*1*1 两行一列的第1个图plt.subplot(2,1,1)  # 位置plt.hist(paidapps.price, log=True)  # 绘制直方图# 标题和label值plt.title("Price distribution of apps (Log scale)")plt.ylabel("Frequency Log scale")plt.xlabel("Price Distributions in ($) ")
# 2、绘制stripplot(分布散点图)# 两行一列的第2个图plt.subplot(2,1,2)plt.title("Visual Price distribution")sns.stripplot(data=paidapps,  # 整体数据              y="price",  # 待绘图的字段              jitter=True,  # 当数据点重合较多时，用该参数做调整              orient="h",  # 水平方向显示 h-水平  v-垂直              size=6             )plt.show()

复制代码

结论 1

随着价格的上涨，付费应用的数量呈现指数级的下降
很少应用的价格超过 30 刀；因此，尽量保持价格在 30 以下

category 对价格分布的影响

data.columns  # 数据字段

复制代码

Index(['id', 'track_name', 'size_bytes', 'currency', 'price',       'rating_count_tot', 'rating_count_ver', 'user_rating',       'user_rating_ver', 'ver', 'cont_rating', 'prime_genre',       'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'],      dtype='object')

复制代码

种类及数目

data["prime_genre"].value_counts()

复制代码

Games                3862Entertainment         535Education             453Photo & Video         349Utilities             248Health & Fitness      180Productivity          178Social Networking     167Lifestyle             144Music                 138Shopping              122Sports                114Book                  112Finance               104Travel                 81News                   75Weather                72Reference              64Food & Drink           63Business               57Navigation             46Medical                23Catalogs               10Name: prime_genre, dtype: int64

复制代码

显示前 5 个种类

# y轴范围yrange = [0,25]fsize =15plt.figure(figsize=(12,10))
# 分别绘制5个子图
# 图1 plt.subplot(5,1,1)plt.xlim(yrange)# 挑出第一类的数据games = paidapps[paidapps["prime_genre"] == "Games"]sns.stripplot(data=games,              y="price",              jitter=True,              orient="h",              size=6,              color="#eb5e66"             )plt.title("Games", fontsize=fsize)plt.xlabel("")
# 图2plt.subplot(5,1,2)  plt.xlim(yrange)# 挑出第一类的数据ent = paidapps[paidapps["prime_genre"] == "Entertainment"]sns.stripplot(data=ent,              y="price",              jitter=True,              orient="h",              size=6,              color="#ff8300"             )plt.title("Entertainment", fontsize=fsize)plt.xlabel("")

# 图3plt.subplot(5,1,3)plt.xlim(yrange)edu = paidapps[paidapps.prime_genre=='Education']sns.stripplot(data=edu,y='price',jitter= True ,orient ='h' ,size=6,color='#20B2AA')plt.title('Education',fontsize=fsize)plt.xlabel('') 
# 图4plt.subplot(5,1,4)plt.xlim(yrange)pv = paidapps[paidapps.prime_genre=='Photo & Video']sns.stripplot(data=pv,              y='price',              jitter= True,              orient ='h',              size=6,              color='#b84efd')plt.title('Photo & Video',fontsize=fsize)plt.xlabel('') 
# 图5(个人添加)plt.subplot(5,1,5)plt.xlim(yrange)ut = paidapps[paidapps.prime_genre=='Utilities']sns.stripplot(data=ut,              y='price',              jitter= True,              orient ='h',              size=6,              color='#084cfd')plt.title('Utilities',fontsize=fsize)plt.xlabel('')

复制代码

结论 2

Games 游戏类的 apps 价格相对高且分布更广，直到 25 美元
Entertainment 娱乐类的 apps 价格相对较低

Paid apps Vs Free apps

付费 APP 和免费 APP 之间的比较

app 种类

# app的种类
categories = data["prime_genre"].value_counts()categories

复制代码

Games                3862Entertainment         535Education             453Photo & Video         349Utilities             248Health & Fitness      180Productivity          178Social Networking     167Lifestyle             144Music                 138Shopping              122Sports                114Book                  112Finance               104Travel                 81News                   75Weather                72Reference              64Food & Drink           63Business               57Navigation             46Medical                23Catalogs               10Name: prime_genre, dtype: int64

复制代码

len(categories)
23

复制代码

选择前 4 个

选择前 4 个，其他的 APP 全部标记为 Other

s = categories.index[:4]s
Index(['Games', 'Entertainment', 'Education', 'Photo & Video'], dtype='object')

复制代码

def categ(x):    if x in s:        return x    else:        return "Others"    data["broad_genre"] = data["prime_genre"].apply(categ)data.head()

复制代码

统计免费和付费 APP 下的种类数

# 免费data[data.price==0].broad_genre.value_counts()

复制代码

Games            2257Others           1166Entertainment     334Photo & Video     167Education         132Name: broad_genre, dtype: int64

复制代码

将两个数据合并起来：

统计量对比

高亮显示最大值（个人增加）

结论 3

从上面的高亮结果中，我们发现：

Games 相关的 APP 是最多的，不管是 paid 还是 free
从付费占比来看，Education 教育类型占比最大
从免费占比来看，Entertainment 娱乐类型的占比最大

付费和免费的占比

生成数据

分组对比付费和免费的占比

list_free = dist.free_per.tolist()list_free

复制代码

[29.13907284768212, 62.42990654205608, 58.44122216468152, 58.35835835835835, 47.85100286532951]

复制代码

# 列表转成元组tuple_free = tuple(list_free)

复制代码

# 付费类型相同操作tuple_paidapps = tuple(dist.paid_per.tolist())

复制代码

柱状图

plt.figure(figsize=(12,8))N = 5ind = np.arange(N)width = 0.56  # 两个柱子间的宽度
p1 = plt.bar(ind, tuple_free, width, color="#45cea2")p2 = plt.bar(ind,tuple_paidapps,width,bottom=tuple_free,color="#fdd400")
plt.xticks(ind,tuple(dist.index.tolist()))plt.legend((p1[0],p2[0]),("free","paid"))plt.show()

复制代码

饼图

# 绘制饼图pies = dist[['free_per','paid_per']]pies.columns=['free %','paid %']pies

复制代码

plt.figure(figsize=(15,8))pies.T.plot.pie(subplots=True,  # 显示子图                 figsize=(20,4),  # 大小                 colors=['#45cea2','#fad470']  # 颜色                )plt.show()

复制代码

结论 4

在教育类的 APP 中，付费 paid 的占比是很高的
相反的，在娱乐类的 APP 中，免费 free 的占比是很高的

付费 APP 真的足够好吗？

价格分类

# 对价格处理  0-free >0则用paid
data["category"] = data["price"].apply(lambda x: "Paid" if x > 0 else "Free")data.head()

复制代码

小提琴图

plt.figure(figsize=(15,8))plt.style.use("fast")plt.ylim([0,5])
plt.title("Distribution of User ratings")
sns.violinplot(data=data, # 数据+2个轴               y="user_rating",               x="broad_genre",               hue="category",  # 分组               vertical=True,  # 垂直显示               kde=False,               split=True,  # 同个类别的小提琴图一起显示               linewidth=2,               scale="count",               palette=['#fdd470','#45cea2']              )
plt.xlabel(" ")plt.ylabel("Rating(0-5)")
plt.show()

复制代码

结论 5（个人增加）

在 Education 类的 APP 中，paid 的占比是明显高于 free；其次是 Photo & Video
Entertainment 娱乐的 APP，free 占比高于 paid；且整体的占比分布更为宽

注意下面的代码中改变了 split 参数：

plt.figure(figsize=(15,8))plt.style.use("fast")plt.ylim([0,5])
plt.title("Distribution of User ratings")
sns.violinplot(data=data,                y="user_rating",               x="broad_genre",               hue="category",                 vertical=True,                kde=False,               split=False,  # 关注这个参数               linewidth=2,               scale="count",               palette=['#fdd470','#45cea2']              )
plt.xlabel(" ")plt.ylabel("Rating(0-5)")
plt.show()

复制代码

size 和 price 关系

探索：是不是价格越高，size 越大了？

sns.color_palette("husl",8)sns.set_style("whitegrid")
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
# 改变字节数data["MB"] = data.size_bytes.apply(lambda x: x/1048576)# 挑选区间的数据paidapps_regression =data[((data.price<30) & (data.price>0))]
sns.lmplot(data=paidapps_regression,           x="MB",           y="price",           size=4,           aspect=2,           col_wrap=2,           hue="broad_genre",           col="broad_genre",           fit_reg=False,           palette=sns.color_palette("husl",5)          )
plt.show()

复制代码

使用 Plotly 实现（个人增加）

增加使用 plotly 实现方法

px.scatter(paidapps_regression,           x="MB",           y="price",           color="broad_genre",           facet_col="broad_genre",           facet_col_wrap=2          )

复制代码

APP 分类：是否可根据 paid 和 free 来划分

5 种类型占比

# 1、设置颜色和大小BlueOrangeWapang = ['#fc910d','#fcb13e','#239cd3','#1674b1','#ed6d50']plt.figure(figsize=(10,10))
# 2、数据label_names=data.broad_genre.value_counts().sort_index().indexsize = data.broad_genre.value_counts().sort_index().tolist()
# 3、内嵌空白圆my_circle=plt.Circle((0,0), 0.5, color='white')# 4、圆plt.pie(size, labels=label_names, colors=BlueOrangeWapang)p=plt.gcf()p.gca().add_artist(my_circle)plt.show()

复制代码

使用 plotly 如何实现：

# Plotly如何实现
fig = px.pie(values=size,             names=label_names,             labels=label_names,             hole=0.5)
fig.update_traces(textposition='inside', textinfo='percent+label') 
fig.show()

复制代码

5 种类型+是否付费

f=pd.DataFrame(index=np.arange(0,10,2),               data=dist.free.values,  # free               columns=['num'])p=pd.DataFrame(index=np.arange(1,11,2),               data=dist.paid.values,  # paid               columns=['num'])
final = pd.concat([f,p],names=['labels']).sort_index()final

复制代码

plt.figure(figsize=(20,20))
group_names=data.broad_genre.value_counts().sort_index().indexgroup_size=data.broad_genre.value_counts().sort_index().tolist()h = ['Free', 'Paid']
subgroup_names= 5*hsub= ['#45cea2','#fdd470']subcolors= 5*subsubgroup_size=final.num.tolist()
# 外层fig, ax = plt.subplots()ax.axis('equal')mypie, _ = ax.pie(group_size, radius=2.5, labels=group_names, colors=BlueOrangeWapang)plt.setp( mypie, width=1.2, edgecolor='white')
# 内层mypie2, _ = ax.pie(subgroup_size, radius=1.6, labels=subgroup_names, labeldistance=0.7, colors=subcolors)plt.setp( mypie2, width=0.8, edgecolor='white')plt.margins(0,0)
plt.show()

复制代码

基于 plotly 的实现：

# plotly如何实现fig = px.sunburst(  data,  path=["broad_genre","category"],  values="MB")
fig.show()

复制代码

发布于: 刚刚阅读数: 2

原文链接:【http://xie.infoq.cn/article/20c95f1355b6ae13ec69a9755】。文章转载请联系作者。

Peter

关注

志之所趋，无远弗届，穷山距海，不能限也。 2019.01.15 加入

还未添加个人简介

发布

暂无评论

创作场景

Pandas+Seaborn+Plotly：联手探索苹果 AppStore

导入库

数据基本信息

APP 信息统计

免费的 APP 数量

价格超过 50 的 APP 数量

价格超过 50 的比例

离群数据

免费 APP

正常区间的 APP

取数

价格分布

结论 1

category 对价格分布的影响

种类及数目

显示前 5 个种类

结论 2

Paid apps Vs Free apps

app 种类

选择前 4 个

统计免费和付费 APP 下的种类数

统计量对比

高亮显示最大值（个人增加）

结论 3

付费和免费的占比

生成数据

柱状图

饼图

结论 4

付费 APP 真的足够好吗？

价格分类

小提琴图

结论 5（个人增加）

size 和 price 关系

使用 Plotly 实现（个人增加）

APP 分类：是否可根据 paid 和 free 来划分

5 种类型占比

5 种类型+是否付费

Peter

评论