Abstract: Learning state representations enables robotic planning directly from raw observations such as images. Several methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space. The similarity between observations in the space of images is often assumed and used as a proxy for estimating similarity between the underlying states of the system. However, observations commonly contain task-irrelevant factors of variation which are nonetheless important for reconstruction, such as varying lighting and different camera viewpoints. In this work, we define relevant evaluation metrics and perform a thorough study of different loss functions for state representation learning. We show that models exploiting task priors, such as Siamese networks with a simple contrastive loss, outperform reconstruction-based representations in visual task planning in case of task-irrelevant factors of variations.
*Contributed equally and listed in alphabetical order
Paper preprint
Download box manipulation datasets
Download shelf arrangement datasets
Download box stacking datasets
Representation evaluation metrics:
Representation t-SNE plots:
Representation evaluation metrics:
Representation t-SNE plots:
Representation evaluation metrics:
Representation t-SNE plots: