Ad

Pd.DataFrame.agg(np.var) Vs Pd.Series.np.var

- 1 answer

Using np.var() in two ways to the same dataset but they are giving two different results. Do not think it's because of n & n-1 issue since it's the same numpy function to the same dataset (a Pandas Series - SAT Math Scores).

These are the two ways:

  1. Directly onto a Series
  2. Using it with a filtered DataFrame + pd.df.agg() method

However, they are giving two different results. I have read elsewhere that this could be because of the way it's being calculated i.e n vs n-1.

Hope for some confirmation/clarification. I am puzzled as I am using the same function np.var() for both occasions:

  1. np.var(sat_2017.Math), np.std(sat_2017.Math)
  2. sat_2017.iloc[:,3].agg([np.var, np.std])

Output:

    • Variance: 7068.194540561321
    • Std.Deviation: 84.07255521608297
    • Variance: 7209.558431
    • Std.Deviation: 84.909119
Ad

Answer

Based on the source code, this seems like a bug.

When pd.Series.agg gets a function object, it looks it up in its predefined list of cython functions:

# pandas.core.base line:555
f = self._is_cython_func(arg)

# pandas.core.base line:639
def _is_cython_func(self, arg):
    """ if we define an internal function for this argument, return it """
    return self._cython_table.get(arg)

which contains:

pd.Series._cython_table
OrderedDict([(<function sum(iterable, start=0, /)>, 'sum'),
         ...
         (<function numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>)>,'var'),

which returns:

f == self._is_cython_func(arg) == 'var'

This then gets used at getattr:

# pandas.core.base line 556
if f and not args and not kwargs:
    return getattr(self, f)(), None

whic returns:

getattr(pd.Series, 'var')
<function pandas.core.series.Series.var(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)>

And there is the culprit! ddof is now 1.

Ad
source: stackoverflow.com
Ad