WPO: Enhancing RLHF with Weighted Preference Optimization
Paper
• 2406.11827
• Published
• 17
Self-Improving Robust Preference Optimization
Paper
• 2406.01660
• Published
• 20
Bootstrapping Language Models with DPO Implicit Rewards
Paper
• 2406.09760
• Published
• 41
BPO: Supercharging Online Preference Learning by Adhering to the
Proximity of Behavior LLM
Paper
• 2406.12168
• Published
• 7
Understanding and Diagnosing Deep Reinforcement Learning
Paper
• 2406.16979
• Published
• 10
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published
• 42
Understand What LLM Needs: Dual Preference Alignment for
Retrieval-Augmented Generation
Paper
• 2406.18676
• Published
• 6
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
Reasoning
Paper
• 2407.00782
• Published
• 24
Direct Preference Knowledge Distillation for Large Language Models
Paper
• 2406.19774
• Published
• 22
Understanding Reference Policies in Direct Preference Optimization
Paper
• 2407.13709
• Published
• 17
Self-Training with Direct Preference Optimization Improves
Chain-of-Thought Reasoning
Paper
• 2407.18248
• Published
• 33
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Paper
• 2410.12784
• Published
• 47