NEP 8 — 向 NumPy 添加 groupby 功能的提案#

作者

特拉维斯·奥利芬特

接触

奥利芬特@enthought ​ com

日期

2010-04-27

地位

延期

执行摘要

NumPy 提供了处理数据和进行计算的工具,其方式与关系代数所允许的方式大致相同。然而,常见的分组功能并不容易处理。 NumPy ufunc 的reduce 方法是放置此groupby 行为的自然位置。此 NEP 描述了 ufunc 的两个附加方法(reduceby 和 reducein)以及两个可以帮助添加此功能的附加函数(segment 和 Edge)。

用例示例#

假设您有一个 NumPy 结构化数组,其中包含有关多天内多家商店的购买数量的信息。需要明确的是,结构化数组数据类型是:

dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
    ('store', i4), ('SKU', 'S6'), ('number', i4)]

假设有一个这种数据类型的一维 NumPy 数组,并且您想要计算按产品、按月份、按商店销售的产品数量的各种统计数据(最大值、最小值、平均值、总和等), ETC。

目前,这可以通过在数组的数字字段上使用reduce方法,再加上就地排序、return_inverse=True和bincount等来实现。但是,对于这种常见的数据分析需求,这会很好有标准和更直接的方法来获得结果。

提出的 Ufunc 方法#

建议向ufunc 添加两个新的reduce 风格的方法:reduceby 和reducein。 reducein 方法旨在成为更简单使用的 reduceat 版本,而 reduceby 方法旨在提供归约的分组功能。

减少:

<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)

Perform a local reduce with slices specified by pairs of indices.

The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).

The indices array provides the start and end indices for the
reduction.  If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.

This generalizes along the given axis, the behavior:

[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
        for i in range(len(indices)/2)]

This assumes indices is of even length

Example:
   >>> a = [0,1,2,4,5,6,9,10]
   >>> add.reducein(a,[0,3,2,5,-2])
   [3, 11, 19]

   Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19

减少:

<ufunc>.reduceby(arr, by, dtype=None, out=None)

Perform a reduction in arr over unique non-negative integers in by.


Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored.  Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.

The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.

提议的功能#

  • 部分

  • 边缘