Illustrating Reinforcement Learning from Human Feedback (RLHF)

Hugging Face Blog · 2022-12-09

Related items

AIarXiv cs.AI2026-05-26

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired beha...

AIHugging Face Blog2022-05-04