Taiwei Shi

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

Arxiv Preprint (Preprint), 2024

Abstract

As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages real-time, in-situ user interactions to create preference datasets that more accurately reflect authentic human values. WildFeedback operates through a three-step process: feedback signal identification, preference data construction, and user-guided evaluation. We applied this framework to a large corpus of user-LLM conversations, resulting in a rich preference dataset that reflects genuine user preferences. This dataset captures the nuances of user preferences by identifying and classifying feedback signals within natural conversations, thereby enabling the construction of more representative and context-sensitive alignment data. Our extensive experiments demonstrate that LLMs fine-tuned on WildFeedback exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed user-guided evaluation. By incorporating real-time feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users. In summary, WildFeedback offers a robust, scalable solution for aligning LLMs with true human values, setting a new standard for the development and evaluation of user-centric language models.

image

Method Overview

WildFeedback operates through a three-step process:

Feedback Signal Identification

This step involves analyzing user-LLM interactions to identify feedback signals (satisfaction or dissatisfaction). Feedback signals are extracted from real dialogues using rubrics to classify user satisfaction (SAT) and dissatisfaction (DSAT).

Preference Data Construction

Conversations containing satisfaction (SAT) or dissatisfaction (DSAT) signals are used to identify prompts and summarize user preferences, such as preferences for more detailed or precise responses. Dispreferred responses are directly taken from instances that triggered DSAT signals, while preferred responses are generated using GPT-4 or on-policy models guided by the summarized user preferences. To ensure safety, additional instructions prevent generating harmful content, and moderation filters are applied. This approach produces a dataset that better captures authentic user preferences, enhancing LLM alignment with real-world user expectations.

User-guided Evaluation

The user-guided evaluation in WildFeedback aligns model assessments with real user preferences by incorporating direct user feedback into the evaluation process. Instead of relying solely on automated or human annotator judgments, this method uses feedback signals from user-LLM interactions to guide evaluations, ensuring they reflect actual user expectations. Evaluators, including LLMs like GPT-4, are provided with a checklist of summarized user preferences from the dataset, which informs their assessment of model responses. This approach reduces biases common in traditional benchmarks and ensures that the evaluation process accurately measures how well models meet user needs, leading to more reliable and user-aligned performance metrics.

Key Takeaways

We applied WildFeedback to the WildChat dataset and constructed a preference dataset of more than 20k samples. To validate the effectiveness of WildFeedback, We finetune Mistral, Phi 3, LLaMA 3 on it and compare their performances with the non-finetuned models on MT-Bench, AlpacaEval 2, Arena-Hard, and the held-out test set of WildFeedback. For WildFeedback evaluation, we report the win, tie, lose percentage against the off-the-shelf instruct models with GPT-4 as the judge. Results are shown in Table 3. Some key takeaways are

  1. Training models on the GPT-4 version of WildFeedback can significantly and consistently boost model performance across all benchmarks. Models trained with the GPT-4 version of WildFeedback exhibit higher win rates across AlpacaEval 2, Arena-Hard, and MT-Bench, as well as improved performance in both settings of WildFeedback (with and without a checklist).
  2. WildFeedback significantly enhances model alignment with in-situ user feedback. As detailed in the previous section, WildFeedback has two versions, differing in whether the preferred responses are generated by GPT-4 or the policy models themselves. Compared to off-the-shelf instruction models, those trained on either version of WildFeedback demonstrate a stronger alignment with real user preferences, winning much more often on the WildFeedback test set.
  3. WildFeedback does not compromise model performance on other benchmarks. Training on either version of WildFeedback not only aligns models more closely with user preferences but also does not compromise performance on other benchmarks; in most cases, it even leads to improvements.

image

BibTeX

			
@misc{shi2024wildfeedbackaligningllmsinsitu,
      title={WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback}, 
      author={Taiwei Shi and Zhuoer Wang and Longqi Yang and Ying-Chun Lin and Zexue He and Mengting Wan and Pei Zhou and Sujay Jauhar and Xiaofeng Xu and Xia Song and Jennifer Neville},
      year={2024},
      eprint={2408.15549},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.15549}, 
}