DecipherPref

Introduction

Human preference judgments are pivotal in guiding large language models (LLMs) to produce outputs that align with human values. Human evaluations are also used in summarization tasks to compare outputs from various systems, complementing existing automatic metrics. Despite their significance, however, there has been limited research probing these pairwise or k-wise comparisons. The collective impact and relative importance of factors such as output length, informativeness, fluency, and factual consistency are still not well understood. It is also unclear if there are other hidden factors influencing human judgments. In this paper, we conduct an in-depth examination of a collection of pairwise human judgments released by OpenAI. Utilizing the Bradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in these human judgments.

Acknowledgement

Please cite the following paper in work:

DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Fei Liu
Main conference of Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore

Dataset available at Huggingface

Data Structure

{
  "doc_id": <str>,
  "title": <str>,
  "article": <str>, # source document
  "winner_sum": {
      "text": <str>,
      "policy": <str>,
      "annotation": <dict>, # GPT-4 annotation on proposed criterions
      "preference_factors": <list> # List of final preference factors of each summary
  }
  "defeated_sum": {
      "text": <str>,
      "policy": <str>,
      "annotation": <dict>,
      "preference_factors": <list>
  }
}

Bibtex

@misc{hu2023decipherpref,
      title={DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4}, 
      author={Yebowen Hu and Kaiqiang Song and Sangwoo Cho and Xiaoyang Wang and Hassan Foroosh and Fei Liu},
      year={2023},
      eprint={2305.14702},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}