原推:Suggests that a language model inferring at 4 bit precision can deliver ~same performance as an 8 bit model at 2x the memory/latency
Double your AI performance for ~free
I wonder if this is broadly applicable
Tesla FSD uses 8 bit precision iirc