Brett Winton: Facebook 的 LLaMa 模型在 SIQA 上仅达到 52% 这是我们不希望看到相对近期的零样本饱和的唯一基准寻找一个人类将保持优势的领域，这个基准就是它但我怀疑 RLHF 有意义地提高了 AI 性能 https://t.co/HxpcAhH8kO 引用来自@wintonARK 的推文：有人可以在 Social IQa (SIQA) 基准测试中对 chatGPT（或其他人工反馈调整模型）进行基准测试吗？ SIQA 有一个明确的（并且可能是模棱两可的）特定于文化的道德覆盖，我猜 RLHF 调整会带来更好的表现 https://t.co/elRyo3MNnp

Posted on 2023-02-28

原推：Facebook’s LLaMa model only hit 52% on SIQA

It’s the only benchmark where we wouldn’t expect to see relatively near-term zero-shot saturation

Looking for an area that human’s will remain superior, this benchmark is it

But I suspect RLHF meaningfully improves AI performance https://t.co/HxpcAhH8kO

Could somebody please benchmark chatGPT (or another human feedback-tuned model) on the Social IQa (SIQA) benchmark?

There is a clear (and potentially ambiguous) culture-specific ethical overlay to SIQA, and I’m guessing that RLHF-tuning would deliver outperformance https://t.co/elRyo3MNnp

https://twitter.com/wintonARK/status/1630265595001458688