Normalize a column of dataframe using min max normalization based on groupby of another column(使用基于另一列的GROUPBY的最小最大归一化来归一化一列数据帧)
问题描述
数据帧如图所示
Name Job Salary
john painter 40000
peter engineer 50000
sam plumber 30000
john doctor 500000
john driver 20000
sam carpenter 10000
peter scientist 100000
如何按列名分组并对每个组上的薪资列应用规范化?
预期结果:
Name Job Salary
john painter 0.041666
peter engineer 0
sam plumber 1
john doctor 1
john driver 0
sam carpenter 0
peter scientist 1
我已尝试以下操作
data = df.groupby('Name').transform(lambda x: (x - x.min()) / x.max()- x.min())
但是,这会产生
Salary
0 -19999.960000
1 -50000.000000
2 -9999.333333
3 -19999.040000
4 -20000.000000
5 -10000.000000
6 -49999.500000
推荐答案
您马上就到了。
>>> df
Name Job Salary
0 john painter 40000
1 peter engineer 50000
2 sam plumber 30000
3 john doctor 500000
4 john driver 20000
5 sam carpenter 10000
6 peter scientist 100000
>>>
>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))
>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
所以基本上,您只是忘了用括号将x.max() - x.min()
括起来。
请注意,使用一系列矢量化操作可以更快地完成此操作。
>>> grouper = df.groupby('Name')['Salary']
>>> maxes = grouper.transform('max')
>>> mins = grouper.transform('min')
>>>
>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
计时:
>>> # Setup
>>> df = pd.concat([df]*1000, ignore_index=True)
>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group
>>> df
Name Job Salary
0 0 painter 40000
1 0 engineer 50000
2 0 plumber 30000
3 0 doctor 500000
4 1 driver 20000
... ... ... ...
6995 1748 plumber 30000
6996 1749 doctor 500000
6997 1749 driver 20000
6998 1749 carpenter 10000
6999 1749 scientist 100000
[7000 rows x 3 columns]
>>>
>>> # Tests @ i5-6200U CPU @ 2.30GHz
>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))
1.19 s ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
...: grouper = df.groupby('Name')['Salary']
...: maxes = grouper.transform('max')
...: mins = grouper.transform('min')
...: (df.Salary - mins)/(maxes - mins)
...:
...:
3.04 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这篇关于使用基于另一列的GROUPBY的最小最大归一化来归一化一列数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!