Inter-rater reliability calculation for multi-raters data(多评价者数据的评价者间可靠性计算)

本文介绍了多评价者数据的评价者间可靠性计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下列表:

[[1, 1, 1, 1, 3, 0, 0, 1],[1, 1, 1, 1, 3, 0, 0, 1],[1, 1, 1, 1, 2, 0, 0, 1],[1, 1, 0, 2, 3, 1, 0, 1]]

在我想计算 inter-rater 可靠性分数的地方,有多个评分者(行).我不能使用 Fleiss 的 kappa,因为行的总和不是相同的数字.在这种情况下,什么是好的方法?

解决方案

是的,数据准备是这里的关键.一起来看看吧.

虽然 Krippendorff 的 alpha 可能因多种原因而优越,但 numpy 和 statsmodels 提供了从上述表格中获取 Fleiss kappa 所需的一切.尽管 Krippendorff alpha 提供了几乎相同的结果如果使用得当,但 Fleiss 的 kappa 在医学研究中更为普遍.如果它们提供了完全不同的结果,这可能是由于一些用户错误,最重要的是输入数据的格式和测量级别(例如序数与名义) – 跳过解决方案(转置&aggregate): Fleiss kappa 0.845

现在来解决问题:

如果我们将原始数据插入 Fleiss kappa 会发生什么:(我们只使用数据 'dats' 而不是类别列表 'cats')

dats,cats = irr.aggregate_raters(orig)irr.fleiss_kappa(dats, method='fleiss')

<块引用>

-0.12811059907834096

但是……为什么?好吧,看看 orig 数据 - aggregate_raters() 假设 raters 作为列!这意味着我们有完全分歧,例如在第一列和倒数第二列之间 - Fleiss 认为:第一位评分者总是评分为1";并且倒数第二个总是被评为0"->在所有三个主题上完全分歧.

所以我们需要做的是(抱歉我是菜鸟——可能不是最优雅的):

giro = np.array(orig).transpose()转帐

<块引用>

array([[1, 1, 1],[1, 1, 1],[1, 1, 1],[1, 1, 1],[3, 3, 2],[0, 0, 0],[0, 0, 0],[1, 1, 1]])

现在我们将主题作为行,将评分者作为列(三个评分者分配 4 个类别).如果我们将其插入到aggregate_raters() 函数并将结果数据提供给 fleiss 会发生什么?(使用索引 0 获取返回元组的第一部分)

irr.fleiss_kappa(irr.aggregate_raters(giro)[0], method='fleiss')

<块引用>

0.8451612903225807

最后……这更有意义,如果除了主题5[3,3,2]之外,所有三位评估者都完全同意.

克里彭多夫的阿尔法

当前的 krippendorff 实施需要 orig 格式的数据,其中评价者作为行和列作为主题 - 不需要聚合函数来准备数据.所以我可以看到这是一个更简单的解决方案.Fleiss 在医学研究中仍然非常流行,让我们看看它的比较:

将 krippendorff 导入为 kdkd.alpha(原版)

<块引用>

0.9359

哇……这比 Fleiss 的 kappa 高很多……好吧,我们需要告诉 Krippendorff Steven 对变量的测量水平.它必须是 'nominal'、'ordinal'、'interval'、'ratio' 或可调用的其中之一. - 这是 Krippendorff alpha 的差分函数".https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers

kd.alpha(orig, level_of_measurement='nominal')

<块引用>

0.8516

希望这会有所帮助,我在写这篇文章时学到了很多.

I have the following list of lists:

[[1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 2, 0, 0, 1],
 [1, 1, 0, 2, 3, 1, 0, 1]]

Where I want to calculate an inter-rater reliability score, there are multiple raters(rows). I cannot use Fleiss' kappa, since the rows do not sum to the same number. What is a good approach in this case?

解决方案

Yes, data preparation is key here. Let's walk through it together.

While Krippendorff's alpha may be superior for any number of reasons, numpy and statsmodels provide everything you need to get Fleiss kappa from the above mentioned table. Fleiss' kappa is more prevalent in medical research despite Krippendorff alpha delivering mostly the same result if used correctly. If they deliver substantially different results this might be due to a number of user errors, most importantly format of input data and level of measurement (eg. ordinal vs. nominal) – skip ahead for the solution (transpose & aggregate): Fleiss kappa 0.845

Nowtothesolutionoftheproblem:

What happens if we plug the original data into Fleiss kappa: (we just use the data 'dats' not the category list 'cats')

dats, cats = irr.aggregate_raters(orig)
irr.fleiss_kappa(dats, method='fleiss')

-0.12811059907834096

But... why? Well, look at the orig data – aggregate_raters() is assuming raters as columns ! This means that we have perfect disagreement e.g. between the first column and the second to last column – Fleiss thinks: "first rater always rated "1" and second to last always rated "0" -> perfect disagreement on all three subjects.

So what we need to do is (sorry I'm a noob – might not be the most elegant):

giro = np.array(orig).transpose()
giro

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [3, 3, 2],
       [0, 0, 0],
       [0, 0, 0],
       [1, 1, 1]]) 

Now we have subjects as rows and raters as columns (three raters assigning 4 categories). What happens if we plug this in into the aggregate_raters() function and feed the resulting data into fleiss ? (using index 0 to grab first part of returned tuple)

irr.fleiss_kappa(irr.aggregate_raters(giro)[0], method='fleiss')

0.8451612903225807

Finally… thismakesmoresense,ifallthreeratersagreedperfectlyexceptonsubject5[3,3,2].

Krippendorff's alpha

The current krippendorff implementation expects the data in the orig format with raters as rows and columns as subjects – no aggregation function needed to prepare the data. So I can see how this was the simpler solution. Fleiss is still very prevalent in medical research, so lets see how it compares:

import krippendorff as kd
kd.alpha(orig)

0.9359

Wow… that's a lot higher than Fleiss' kappa... Well, we need to tell Krippendorff the "Steven's level of measurement of the variable. It must be one of 'nominal', 'ordinal', 'interval', 'ratio' or a callable." – this is for the 'difference function' of Krippendorff's alpha. https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers

kd.alpha(orig, level_of_measurement='nominal')

0.8516

Hope this helps, I learned a lot writing this.

这篇关于多评价者数据的评价者间可靠性计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!