PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

Abstract

Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods sacrifice spatial detail through computationally intensive downsampling, creating a tradeoff between fidelity and processing efficiency. We introduce PointMapPolicy, a novel approach that treats point clouds as structured point maps without downsampling, preserving complete geometric information while enabling direct processing with standard image encoders. By treating point maps analogously to images, we enable the application of established computer vision techniques to 3D data without sacrificing structural integrity. Using xLSTM as a backbone, our model efficiently fuses these point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks.

Different approaches for point cloud processing

This figure illustrates three paradigms for processing 3D point clouds. (a) Downsampling-based approaches reduce dense clouds to sparse ones using Furthest Point Sampling (FPS), which are then processed by models like PointNet. (b) 3D lifting-based methods first extract image features, lift them to 3D space, and generate semantically rich scene tokens. (c) Our PointMap approach restructures the point cloud as a 2D grid matching the resolution of the corresponding RGB image, allowing independent encoding and fusion with standard vision transformers.

Overview of PMP (PointMapPolicy)

PMP integrates multiple modalities including language, RGB images, and structured point maps. Language instructions are encoded via a pretrained CLIP model. Visual encoders extract features from both image and structured point map modalities. These tokens are jointly fed into a cross-modal X-Block and processed using x-LSTM to produce denoised action predictions, enabling efficient multi-modal imitation learning.

Real-world Experiments by Task

Drawer

In the Drawer task, there is a cabinet with two drawers and two different objects, a cube and a cylinder. The robot must follow a language-specified instruction to open the designated drawer, pick up the target object, place it inside the drawer, and then close the drawer. The key challenges involve handling the random initialization of both the cabinet’s position and the objects’ locations.

Speed 3x

Stack

In the Stack task, four cups of different colors and sizes are provided. The robot must stack the cups in a specific order based on their colors. The main challenges lie in accurately recalling the stacking sequence and executing precise placement, as the cups are closely sized and must fit together properly.

Pour

In the Pour task, three distinct cups and three different containers are placed in randomized initial positions. The robot must generalize to novel object configurations while maintaining the precision necessary to pour the contents from the cups into the containers without spilling. The primary challenge lies in adapting to varying spatial arrangements while executing controlled and accurate pouring motions.

Speed 3x

Sweep

Unlike standard Pick-and-Place tasks, this task requires the robot to acquire a novel sweeping skill. In the Sweep task, the positions of the broom, dustpan, and trash vary across trials, and even the number of trash items changes. The key challenge is manipulating deformable trash materials that differ from those encountered during training, requiring the policy to exhibit strong generalization and adaptability.

Speed 2x

Fold

The Fold task requires precise manipulation skills. The goal is to neatly fold a towel that is randomly oriented at the start of each trial. The primary challenge lies in accurately handling the soft, deformable material to achieve a clean and consistent fold despite varying initial conditions.

Speed 2x

Arrange

In the Arrange task, the setup includes a mixing machine and a container. The robot must follow a specific sequence: first, open the mixing machine; next, place the container on the designated pad; and finally, close the machine. This task primarily emphasizes long-horizon planning, requiring the robot to execute a multi-step procedure in the correct order.

Speed 3x