A vision-language-action model is an end-to-end neural network that takes sensor inputs—camera images, joint positions, ...
In the last decade, convolutional neural networks (CNNs) have been the go-to architecture in computer vision, owing to their powerful capability in learning representations from images/videos.