TruthfulQA - mingoori0512/Mingoori-s-NLP-study-space GitHub Wiki

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Citation

Lin et al. (ACL 2022)

Benchmark?

컴퓨터 분야에서 여러가지 모델들의 성능을 비교 및 평가하기 위한 기준
일반적으로 특정 작업을 수행하는데 걸리는 시간, 처리량, 성능 지표 등을 측정

TruthfulQA?

영어: TruthfulQA is a benchmark made up of questions to cause imitative falsehoods. This is a widely held misconceptions that the model may have learned from its input text. And they are trying to see if model is good enough or smart enough to not fall for the trap.

해석: TruthfulQA는 모방적인 거짓말을 유발하기 위해 질문으로 이루어진 벤치마크이다. 이것은 모델이 입력 텍스트로부터 배울 수 있는 널리 퍼진 오해입니다. 그들은 모델이 함정에 빠지지 않고 충분히 좋거나 똑똑한지를 확인하려고 노력하고 있다.

Contributions

Benchmark Compile a benchmark of questions and answers that can be used to be test the truthfulness or correctness of a large language model.(대규모 언어 모델의 진실성 또는 정확성을 테스트하기 위해 사용할 수 있는 질문과 대답으로 이루어진 벤치마크를 컴파일하였다.)
Establish a baseline

This model also generated answers that were both false and informative 42% of the time(compared to 6% for the human baseline). Such informative answers, which often mimic popular misconceptions, are more likely to deceive.(이 모델은 거짓이면서 정보를 포함한 답변을 생성하는 비율이 42%로 나타났으며(인간 기준의 6%와 비교해서), 종종 인기있는 오해를 흉내 내는 형태의 정보를 담고 있는 이러한 답변은 속일 가능성이 더 높습니다.)
The answers given by the model are judged on two dimensions. One is correctness whether the answer is true or false. And the other one is what they call informativeness, which essentially a measure of how detailed the answer was. So, for example, the answer could be really simple like no comment, but it could be correct, but it won't be informative. So more detailed answers are ranked as more informative. (모델이 제공하는 답변은 두 가지 측면에서 평가된다. 하나는 정확성으로, 답변이 사실인지 아닌지를 나타낸다. 그리고 다른 하나는 정보성으로, 기본적으로 답변이 얼마나 자세한지를 측정하는 지표이다. 예를 들어, 답변은 간단하게 "댓글 없음"과 같이 간단할 수 있지만 이는 정확할 수 있지만 정보성이 떨어지는 답변이다. 그래서 더 자세한 답변은 정보성이 높게 평가된다.)

GPT-judge & GPT-info

Since human evaluation is costly and challenging to replicate, we introduce a new automated metric for evaluating model performance on TruthfulQA, which we call “GPT-judge”. GPT-judge is a GPT-3-6.7B model finetuned to classify answers to the questions in TruthfulQA as true or false.
The training set for GPT-judge consists of triples of the form (question, answer, label), where label is either true or false. The training set includes 6.9k examples taken directly from the benchmark, where the answer is a true/false reference answer written by the authors. It also contains around 15.5k examples where the answer is generated by one of the models in Section 3.1 and the label is a human evaluation.
For the final GPT-judge model, we train on examples from all models. The goal of GPT-judge is to evaluate truth for the questions in TruthfulQA only (with no need to generalize to new questions) and so we always include all questions in the training set. We use the OpenAI API to perform the finetuning (OpenAI, 2020). We also use an identical approach to finetune a model to evaluate informativeness (rather than truthfulness).
Separately, to estimate GPT-judge’s ability to generalize to a new model family F, we fine-tune a GPT-judge model on all other model families and use F as a validation set. These validation accuracies are shown in Table 1 below, which includes additional comparisons of GPT-judge to alternate metrics that make use of ROUGE1 (Lin, 2004) or BLEURT (Sellam et al., 2020). To compute a truthfulness score for a model answer a, these metrics find the closest true and false reference answers to a and then take the arithmetic difference between match scores.
Overlap or semantic similarity between a and each reference answer is measured using ROUGE1 or BLEURT, respectively. GPTjudge performs well in an absolute sense, demonstrating high validation accuracy across all four model families and preserving the rank ordering of models within each family. It also outperforms all alternate metrics in evaluating model answers. We believe that GPT-judge is a reasonable proxy for human evaluation, although the minor weakness shown in Table 3 suggests that human evaluation should still be considered the gold standard.
인간 평가는 비용이 많이 들고 복제하기 어렵기 때문에, 우리는 TruthfulQA 모델 성능을 평가하기 위한 자동화된 새로운 측정 항목인 'GPT-judge'를 소개합니다. GPT-judge는 TruthfulQA의 질문에 대한 답변을 참 또는 거짓으로 분류하기 위해 미세 조정된 GPT-3-6.7B 모델입니다.
GPT-judge의 훈련 세트는 (질문, 답변, 레이블) 형식의 세트로 구성되며, 레이블은 참 또는 거짓 중 하나입니다. 훈련 세트에는 저자가 작성한 참/거짓 참조 답변이 있는 벤치마크에서 직접 가져온 6.9k 개의 예제와, 답변이 Section 3.1의 모델 중 하나에 의해 생성되고 레이블은 인간 평가인 약 15.5k 개의 예제가 포함되어 있습니다.
최종 GPT-judge 모델의 경우, 모든 모델에서 예제를 훈련합니다. GPT-judge의 목표는 TruthfulQA의 질문에 대한 진실성을 평가하는 것뿐이며(새로운 질문으로 일반화할 필요 없음), 항상 훈련 세트에 모든 질문을 포함시킵니다. 미세 조정을 수행하기 위해 OpenAI API를 사용합니다. 진실성 대신 정보성을 평가하기 위해 동일한 접근 방식을 사용하여 모델을 미세 조정합니다.
또한 GPT-judge의 모델 능력을 새로운 모델 패밀리 F로 일반화하기 위해, GPT-judge 모델을 다른 모든 모델 패밀리에서 미세 조정하고 F를 검증 세트로 사용하여 추정합니다. 이러한 검증 정확도는 아래 표 1에 나와 있으며, ROUGE1(Lin, 2004) 또는 BLEURT(Sellam et al., 2020)를 활용하는 대안 측정 항목과 GPT-judge를 비교하는 추가 정보도 포함되어 있습니다. 모델 답변의 진실성 점수를 계산하기 위해 이러한 메트릭은 a와 가장 가까운 참과 거짓 참조 답변을 찾은 다음 일치 점수 사이의 산술 차이를 취합니다.
중복 또는 의미적 유사성은 각각 ROUGE1 또는 BLEURT를 사용하여 a와 각 참조 답변 간에 측정됩니다. GPT-judge는 절대적인 의미에서 우수한 성능을 나타내며, 모든 네 모델 패밀리에서 높은 검증 정확도를 보이며 각 패밀리 내의 모델 순위를 유지합니다. 또한 모델 답변을 평가하는 모든 대체 측정 항목을 능가합니다. 우리는 GPT-judge가 인간 평가의 합리적인 대리자임을 믿지만, 표 3에서 나타난 작은 약점은 여전히 인간 평가를 골드 스탠더드로 고려해야 함을 시사합니다.