Pandas 高级教程之:window 操作

程序那些事

关注

发布于: 1 小时前

简介

在数据统计中，经常需要进行一些范围操作，这些范围我们可以称之为一个 window 。Pandas 提供了一个 rolling 方法，通过滚动 window 来进行统计计算。

本文将会探讨一下 rolling 中的 window 用法。

滚动窗口

我们有 5 个数，我们希望滚动统计两个数的和，那么可以这样：

In [1]: s = pd.Series(range(5))
In [2]: s.rolling(window=2).sum()Out[2]: 0    NaN1    1.02    3.03    5.04    7.0dtype: float64

复制代码

rolling 对象可以通过 for 来遍历：

In [3]: for window in s.rolling(window=2):   ...:     print(window)   ...: 0    0dtype: int640    01    1dtype: int641    12    2dtype: int642    23    3dtype: int643    34    4dtype: int64

复制代码

pandas 中有四种 window 操作，我们看下他们的定义：

看一个基于时间 rolling 的例子：

In [4]: s = pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D'))
In [5]: s.rolling(window='2D').sum()Out[5]: 2020-01-01    0.02020-01-02    1.02020-01-03    3.02020-01-04    5.02020-01-05    7.0Freq: D, dtype: float64

复制代码

设置 min_periods 可以指定 window 中的最小的 NaN 的个数：

In [8]: s = pd.Series([np.nan, 1, 2, np.nan, np.nan, 3])
In [9]: s.rolling(window=3, min_periods=1).sum()Out[9]: 0    NaN1    1.02    3.03    3.04    2.05    3.0dtype: float64
In [10]: s.rolling(window=3, min_periods=2).sum()Out[10]: 0    NaN1    NaN2    3.03    3.04    NaN5    NaNdtype: float64
# Equivalent to min_periods=3In [11]: s.rolling(window=3, min_periods=None).sum()Out[11]: 0   NaN1   NaN2   NaN3   NaN4   NaN5   NaNdtype: float64

复制代码

Center window

默认情况下 window 的统计是以最右为准，比如 window=5,那么前面的 0，1，2，3 因为没有达到 5，所以为 NaN。

In [19]: s = pd.Series(range(10))
In [20]: s.rolling(window=5).mean()Out[20]: 0    NaN1    NaN2    NaN3    NaN4    2.05    3.06    4.07    5.08    6.09    7.0dtype: float64

复制代码

可以对这种方式进行修改，设置 center=True 可以从中间统计：

In [21]: s.rolling(window=5, center=True).mean()Out[21]: 0    NaN1    NaN2    2.03    3.04    4.05    5.06    6.07    7.08    NaN9    NaNdtype: float64

复制代码

Weighted window 加权窗口

使用 win_type 可以指定加权窗口的类型。其中 win_type 必须是 scipy.signal 中的 window 类型。

举几个例子：

In [47]: s = pd.Series(range(10))
In [48]: s.rolling(window=5).mean()Out[48]: 0    NaN1    NaN2    NaN3    NaN4    2.05    3.06    4.07    5.08    6.09    7.0dtype: float64
In [49]: s.rolling(window=5, win_type="triang").mean()Out[49]: 0    NaN1    NaN2    NaN3    NaN4    2.05    3.06    4.07    5.08    6.09    7.0dtype: float64
# Supplementary Scipy arguments passed in the aggregation functionIn [50]: s.rolling(window=5, win_type="gaussian").mean(std=0.1)Out[50]: 0    NaN1    NaN2    NaN3    NaN4    2.05    3.06    4.07    5.08    6.09    7.0dtype: float64

复制代码

扩展窗口

扩展窗口会产生聚合统计信息的值，其中包含该时间点之前的所有可用数据。

In [51]: df = pd.DataFrame(range(5))
In [52]: df.rolling(window=len(df), min_periods=1).mean()Out[52]:      00  0.01  0.52  1.03  1.54  2.0
In [53]: df.expanding(min_periods=1).mean()Out[53]:      00  0.01  0.52  1.03  1.54  2.0

复制代码

指数加权窗口

指数加权窗口与扩展窗口相似，但每个先验点相对于当前点均按指数加权。

加权计算的公式是这样的：

y_t=Σ^t_{i=0}{w_ix_{t-i}\over{Σ^t_{i=0}w_i}}yt=Σi=0tΣi=0twiwixt−i

其中 x_txt是输入，y_tyt是输出，w_iwi是权重。

EW 有两种模式，一种模式是 adjust=True ，这种情况下 Error: Font metrics not found for font: .

一种模式是 adjust=False ，这种情况下：ParseError: KaTeX parse error: Undefined control sequence: \n at position 8: y_0=x_0\̲n̲

y_t=(1-a)y_{t-1}+ax_t

其中 0<𝛼≤1, 根据 EM 方式的不同 a 可以有不同的取值：

ParseError: KaTeX parse error: Expected '}', got 'EOF' at end of input: …lf-life h > 0 }

举个例子：

In [54]: df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]})
In [55]: dfOut[55]:      B0  0.01  1.02  2.03  NaN4  4.0
In [56]: times = ["2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17"]
In [57]: df.ewm(halflife="4 days", times=pd.DatetimeIndex(times)).mean()Out[57]:           B0  0.0000001  0.5857862  1.5238893  1.5238894  3.233686

复制代码

本文已收录于 http://www.flydean.com/12-python-pandas-window/
最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！

发布于: 1 小时前阅读数: 3

原文链接:【http://xie.infoq.cn/article/c823bdbdbeccc8e129682f1ea】。文章转载请联系作者。

程序那些事

关注

关注公众号：程序那些事，更多精彩等着你！ 2020.06.07 加入

最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧，尽在公众号：程序那些事！

发布

暂无评论

创作场景

Pandas 高级教程之:window 操作

简介

滚动窗口

Center window

Weighted window 加权窗口

扩展窗口

指数加权窗口

程序那些事

评论