rlhf_dpo - cccbook/py2gpt GitHub Wiki 從 RLHF 到 DPO arxiv: Direct Preference Optimization: Your Language Model is Secretly a Reward Model RLHF vs DPO https://huggingface.co/docs/trl/main/en/online_dpo_trainer https://huggingface.co/docs/trl/en/dpo_trainer