Pandas 高级教程之:category 数据类型

程序那些事

关注

发布于: 41 分钟前

简介

Pandas 中有一种特殊的数据类型叫做 category。它表示的是一个类别，一般用在统计分类中，比如性别，血型，分类，级别等等。有点像 java 中的 enum。

今天给大家详细讲解一下 category 的用法。

创建 category

使用 Series 创建

在创建 Series 的同时添加 dtype=”category”就可以创建好 category 了。category 分为两部分，一部分是 order，一部分是字面量：

In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [2]: sOut[2]: 0    a1    b2    c3    adtype: categoryCategories (3, object): ['a', 'b', 'c']

复制代码

可以将 DF 中的 Series 转换为 category：

In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
In [4]: df["B"] = df["A"].astype("category")
In [5]: df["B"]Out[32]: 0    a1    b2    c3    aName: B, dtype: categoryCategories (3, object): [a, b, c]

复制代码

可以创建好一个pandas.Categorical ，将其作为参数传递给 Series：

In [10]: raw_cat = pd.Categorical(   ....:     ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False   ....: )   ....: 
In [11]: s = pd.Series(raw_cat)
In [12]: sOut[12]: 0    NaN1      b2      c3    NaNdtype: categoryCategories (3, object): ['b', 'c', 'd']

复制代码

使用 DF 创建

创建 DataFrame 的时候，也可以传入 dtype=”category”：

In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")
In [18]: df.dtypesOut[18]: A    categoryB    categorydtype: object

复制代码

DF 中的 A 和 B 都是一个 category:

In [19]: df["A"]Out[19]: 0    a1    b2    c3    aName: A, dtype: categoryCategories (3, object): ['a', 'b', 'c']
In [20]: df["B"]Out[20]: 0    b1    c2    c3    dName: B, dtype: categoryCategories (3, object): ['b', 'c', 'd']

复制代码

或者使用 df.astype(“category”)将 DF 中所有的 Series 转换为 category:

In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
In [22]: df_cat = df.astype("category")
In [23]: df_cat.dtypesOut[23]: A    categoryB    categorydtype: object

复制代码

创建控制

默认情况下传入 dtype=’category’ 创建出来的 category 使用的是默认值：

Categories 是从数据中推断出来的。
Categories 是没有大小顺序的。

可以显示创建 CategoricalDtype 来修改上面的两个默认值：

In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)
In [29]: s_cat = s.astype(cat_type)
In [30]: s_catOut[30]: 0    NaN1      b2      c3    NaNdtype: categoryCategories (3, object): ['b' < 'c' < 'd']

复制代码

同样的 CategoricalDtype 还可以用在 DF 中：

In [31]: from pandas.api.types import CategoricalDtype
In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
In [34]: df_cat = df.astype(cat_type)
In [35]: df_cat["A"]Out[35]: 0    a1    b2    c3    aName: A, dtype: categoryCategories (4, object): ['a' < 'b' < 'c' < 'd']
In [36]: df_cat["B"]Out[36]: 0    b1    c2    c3    dName: B, dtype: categoryCategories (4, object): ['a' < 'b' < 'c' < 'd']

复制代码

转换为原始类型

使用Series.astype(original_dtype) 或者 np.asarray(categorical)可以将 Category 转换为原始类型：

In [39]: s = pd.Series(["a", "b", "c", "a"])
In [40]: sOut[40]: 0    a1    b2    c3    adtype: object
In [41]: s2 = s.astype("category")
In [42]: s2Out[42]: 0    a1    b2    c3    adtype: categoryCategories (3, object): ['a', 'b', 'c']
In [43]: s2.astype(str)Out[43]: 0    a1    b2    c3    adtype: object
In [44]: np.asarray(s2)Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

复制代码

categories 的操作

获取 category 的属性

Categorical 数据有 categories 和 ordered 两个属性。可以通过s.cat.categories 和 s.cat.ordered来获取：

In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [58]: s.cat.categoriesOut[58]: Index(['a', 'b', 'c'], dtype='object')
In [59]: s.cat.orderedOut[59]: False

复制代码

重排 category 的顺序：

In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))
In [61]: s.cat.categoriesOut[61]: Index(['c', 'b', 'a'], dtype='object')
In [62]: s.cat.orderedOut[62]: False

复制代码

重命名 categories

通过给 s.cat.categories 赋值可以重命名 categories:

In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [68]: sOut[68]: 0    a1    b2    c3    adtype: categoryCategories (3, object): ['a', 'b', 'c']
In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]
In [70]: sOut[70]: 0    Group a1    Group b2    Group c3    Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']

复制代码

使用 rename_categories 可以达到同样的效果：

In [71]: s = s.cat.rename_categories([1, 2, 3])
In [72]: sOut[72]: 0    11    22    33    1dtype: categoryCategories (3, int64): [1, 2, 3]

复制代码

或者使用字典对象：

# You can also pass a dict-like object to map the renamingIn [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
In [74]: sOut[74]: 0    x1    y2    z3    xdtype: categoryCategories (3, object): ['x', 'y', 'z']

复制代码

使用 add_categories 添加 category

可以使用 add_categories 来添加 category:

In [77]: s = s.cat.add_categories([4])
In [78]: s.cat.categoriesOut[78]: Index(['x', 'y', 'z', 4], dtype='object')
In [79]: sOut[79]: 0    x1    y2    z3    xdtype: categoryCategories (4, object): ['x', 'y', 'z', 4]

复制代码

使用 remove_categories 删除 category

In [80]: s = s.cat.remove_categories([4])
In [81]: sOut[81]: 0    x1    y2    z3    xdtype: categoryCategories (3, object): ['x', 'y', 'z']

复制代码

删除未使用的 cagtegory

In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
In [83]: sOut[83]: 0    a1    b2    adtype: categoryCategories (4, object): ['a', 'b', 'c', 'd']
In [84]: s.cat.remove_unused_categories()Out[84]: 0    a1    b2    adtype: categoryCategories (2, object): ['a', 'b']

复制代码

重置 cagtegory

使用set_categories()可以同时进行添加和删除 category 操作：

In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category")
In [86]: sOut[86]: 0     one1     two2    four3       -dtype: categoryCategories (4, object): ['-', 'four', 'one', 'two']
In [87]: s = s.cat.set_categories(["one", "two", "three", "four"])
In [88]: sOut[88]: 0     one1     two2    four3     NaNdtype: categoryCategories (4, object): ['one', 'two', 'three', 'four']

复制代码

category 排序

如果 category 创建的时候带有 ordered=True ，那么可以对其进行排序操作：

In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
In [92]: s.sort_values(inplace=True)
In [93]: sOut[93]: 0    a3    a1    b2    cdtype: categoryCategories (3, object): ['a' < 'b' < 'c']
In [94]: s.min(), s.max()Out[94]: ('a', 'c')

复制代码

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序：

In [95]: s.cat.as_ordered()Out[95]: 0    a3    a1    b2    cdtype: categoryCategories (3, object): ['a' < 'b' < 'c']
In [96]: s.cat.as_unordered()Out[96]: 0    a3    a1    b2    cdtype: categoryCategories (3, object): ['a', 'b', 'c']

复制代码

重排序

使用 Categorical.reorder_categories() 可以对现有的 category 进行重排序：

In [103]: s = pd.Series([1, 2, 3, 1], dtype="category")
In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)
In [105]: sOut[105]: 0    11    22    33    1dtype: categoryCategories (3, int64): [2 < 3 < 1]

复制代码

多列排序

sort_values 支持多列进行排序：

In [109]: dfs = pd.DataFrame(   .....:     {   .....:         "A": pd.Categorical(   .....:             list("bbeebbaa"),   .....:             categories=["e", "a", "b"],   .....:             ordered=True,   .....:         ),   .....:         "B": [1, 2, 1, 2, 2, 1, 2, 1],   .....:     }   .....: )   .....: 
In [110]: dfs.sort_values(by=["A", "B"])Out[110]:    A  B2  e  13  e  27  a  16  a  20  b  15  b  11  b  24  b  2

复制代码

比较操作

如果创建的时候设置了 orderedTrue ，那么 category 之间就可以进行比较操作。支持==, !=, >, >=, <, 和 <=这些操作符。

In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))
In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))

复制代码

In [119]: cat > cat_baseOut[119]: 0     True1    False2    Falsedtype: bool
In [120]: cat > 2Out[120]: 0     True1    False2    Falsedtype: bool

复制代码

其他操作

Cagetory 本质上来说还是一个 Series，所以 Series 的操作 category 基本上都可以使用，比如： Series.min(), Series.max() 和 Series.mode()。

value_counts：

In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
In [132]: s.value_counts()Out[132]: c    2a    1b    1d    0dtype: int64

复制代码

DataFrame.sum()：

In [133]: columns = pd.Categorical(   .....:     ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True   .....: )   .....: 
In [134]: df = pd.DataFrame(   .....:     data=[[1, 2, 3], [4, 5, 6]],   .....:     columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),   .....: )   .....: 
In [135]: df.sum(axis=1, level=1)Out[135]:    One  Two  Three0    3    3      01    9    6      0

复制代码

Groupby：

In [136]: cats = pd.Categorical(   .....:     ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]   .....: )   .....: 
In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
In [138]: df.groupby("cats").mean()Out[138]:       valuescats        a        1.0b        2.0c        4.0d        NaN
In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [140]: df2 = pd.DataFrame(   .....:     {   .....:         "cats": cats2,   .....:         "B": ["c", "d", "c", "d"],   .....:         "values": [1, 2, 3, 4],   .....:     }   .....: )   .....: 
In [141]: df2.groupby(["cats", "B"]).mean()Out[141]:         valuescats B        a    c     1.0     d     2.0b    c     3.0     d     4.0c    c     NaN     d     NaN

复制代码

Pivot tables：

In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
In [144]: pd.pivot_table(df, values="values", index=["A", "B"])Out[144]:      valuesA B        a c       1  d       2b c       3  d       4

复制代码

本文已收录于 http://www.flydean.com/08-python-pandas-category/
最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！
欢迎关注我的公众号:「程序那些事」,懂技术，更懂你！

发布于: 41 分钟前阅读数: 2

原文链接:【http://xie.infoq.cn/article/f0f967446707f96044528cbb6】。文章转载请联系作者。

程序那些事

关注

关注公众号：程序那些事，更多精彩等着你！ 2020.06.07 加入

最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧，尽在公众号：程序那些事！

发布

暂无评论

创作场景

Pandas 高级教程之:category 数据类型

简介

创建 category

使用 Series 创建

使用 DF 创建

创建控制

转换为原始类型

categories 的操作

获取 category 的属性

重命名 categories

使用 add_categories 添加 category

使用 remove_categories 删除 category

删除未使用的 cagtegory

重置 cagtegory

category 排序

重排序

多列排序

比较操作

其他操作

程序那些事

评论