原推:question on reinforcement learning with human feedback
The ratio of RL preference model dataset (50k datapoints) to model size (50b parameters)
Is very different than for the underlying language model (1,500b tokens to 70b parameters)
Why?
Am I comparing apples to oranges?
https://twitter.com/wintonARK/status/1623786706410541056