When should I (not) want to use pandas apply() in my code?(我什么时候应该(不)想在我的代码中使用 pandas apply()?)
问题描述
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply
. I have also seen users commenting under them saying that "apply
is slow, and should be avoided".
I have read many articles on the topic of performance that explain apply
is slow. I have also seen a disclaimer in the docs about how apply
is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply
should be avoided if possible. However, this raises the following questions:
- If
apply
is so bad, then why is it in the API? - How and when should I make my code
apply
-free? - Are there ever any situations where
apply
is good (better than other possible solutions)?
apply
, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"If
apply
is so bad, then why is it in the API?"
DataFrame.apply
and Series.apply
are convenience functions defined on DataFrame and Series object respectively. apply
accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply
is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply
can do:
- Run any user-defined function on a DataFrame or Series
- Apply a function either row-wise (
axis=1
) or column-wise (axis=0
) on a DataFrame - Perform index alignment while applying the function
- Perform aggregation with user-defined functions (however, we usually prefer
agg
ortransform
in these cases) - Perform element-wise transformations
- Broadcast aggregated results to original rows (see the
result_type
argument). - Accept positional/keyword arguments to pass to the user-defined functions.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
So, with all these features, why is apply
bad? It is because apply
is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply
incurs some major overhead at each iteration. Further, apply
consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply
is appropriate to use (more on that below). If you're not sure whether you should be using apply
, you probably shouldn't.
Let's address the next question.
"How and when should I make my code
apply
-free?"
To rephrase, here are some common situations where you will want to get rid of any calls to apply
.
This seems like an idiosyncrasy of the API. Using apply
to convert integers in a Series to string is comparable (and sometimes faster) than using astype
.
The graph was plotted using the perfplot
library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype
is consistently as fast as, or slightly faster than apply
. So this has to do with the fact that the data in the test is integer type.
GroupBy
operations with chained transformations
GroupBy.apply
has not been discussed until now, but GroupBy.apply
is also an iterative convenience function to handle anything that the existing GroupBy
functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply
, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply
is an acceptable solution if the goal is to reduce a groupby
call (because groupby
is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply
operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply
may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply
on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
这篇关于我什么时候应该(不)想在我的代码中使用 pandas apply()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!