Original text is not available for public display.
Related items
AIarXiv cs.AI
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired beha...
AIHugging Face Blog
An Introduction to Deep Reinforcement Learning
AIHugging Face Blog
Introducing ⚔️ AI vs. AI ⚔️ a deep reinforcement learning multi-agents competition system
AIHugging Face Blog
Putting RL back in RLHF
AIHugging Face Blog
Federated Learning using Hugging Face and Flower
AIHugging Face Blog