Brett Winton: 关于人类反馈强化学习的问题 RL 偏好模型数据集（50k 数据点）与模型大小（50b 参数）的比率与底层语言模型非常不同（1,500b 标记到 70b 参数）为什么？我是在比较苹果和橙子吗？

Posted on 2023-02-10

原推：question on reinforcement learning with human feedback

The ratio of RL preference model dataset (50k datapoints) to model size (50b parameters)

Is very different than for the underlying language model (1,500b tokens to 70b parameters)

Why?

Am I comparing apples to oranges?
https://twitter.com/wintonARK/status/1623786706410541056