原推:Would not surprise me if multi-modal data become the critical ingredients for securing a foundation model position. https://t.co/eulYOCcWnb
What happens when we train the largest vision-language model and add in robot experiences?
The result is PaLM-E ??, a 562-billion parameter, general-purpose, embodied visual-language generalist – across robotics, vision, and language.
Website: palm-e.github.io https://t.co/5qfK23g52d