Get Started with the Dataset

Download the complete dataset and explore comprehensive documentation to start your research.

Overview of HDF5 File Structure
Group Description (Dimensions)
action Leader joint position data (14,)
observations Sensor observations collected per timestep (group contains depth and image subgroups)
depth Depth camera frames
dcam_high High-resolution depth image (480, 640)
dcam_low Wide-angle depth image (480, 640)
images RGB camera frames
cam_high Overhead RGB (480, 640, 3) - JPEG compressed with OpenCV
cam_left_wrist Left wrist RGB (480, 640, 3) - JPEG compressed with OpenCV
cam_low Wide-angle RGB (480, 640, 3) - JPEG compressed with OpenCV
cam_right_wrist Right wrist RGB (480, 640, 3) - JPEG compressed with OpenCV
qpos Follower joint position data (14,)
qvel Follower joint velocity data (14,)
text Textual metadata (prompts and embeddings)
prompt Text prompt (string)
text_embedding Text embedding vector (384,)

Dataset Usage Example

Below is a simplified example of how to load and process the AIST Bimanual Manipulation Dataset for robot learning applications.

PyTorch Dataset Implementation

Core implementation for loading bimanual manipulation data


import torch
import h5py
import numpy as np
from torch.utils.data import Dataset

class SampleLoader(Dataset):
    """Dataset loader for AIST Bimanual Manipulation data"""
    
    def __init__(self, episodes, camera_names=['cam_high', 'cam_low']):
    self.episodes = episodes
    self.camera_names = camera_names

    self.is_compressed = True
    
    # Load valid sample indices
    self._load_episodes()
    
    def _load_episodes(self):
    """Load episode information and calculate valid samples"""
    self.samples = []
    
    for episode_path in self.episodes:
        with h5py.File(episode_path, 'r') as f:
        episode_len = f['/observations/qpos'].shape[0]
        
        # Ensure we have enough data for observation and action sequences
        min_start = self.obs_horizon - 1
        max_start = episode_len - self.action_horizon
        
        for start_ts in range(min_start, max_start + 1):
            self.samples.append((episode_path, start_ts))
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        episode_path, start_ts = self.samples[idx]
    
        with h5py.File(episode_path, 'r') as f:
            # Load observation sequence (multiple frames)
            obs_start = start_ts - self.obs_horizon + 1
            obs_end = start_ts + 1
            
            # Joint positions and velocities
            qpos = f['/observations/qpos'][obs_start:obs_end]
            qvel = f['/observations/qvel'][obs_start:obs_end]
            
            # Multi-camera images
            images = {}
            for cam in self.camera_names:
                images[cam] = f[f'/observations/images/{cam}'][obs_start:obs_end]
            
            # Decompress
            if self.is_compressed:
                # Decompress images
                for cam_name in images:
                    decompressed_image = []
                    for img_compressed in images[cam_name]:
                        decompressed_img = cv2.imdecode(img_compressed, 1)
                        decompressed_image.append(np.array(decompressed_img))
                    images[cam_name] = np.array(decompressed_image)

            
            # Action sequence
            actions = f['/action'][start_ts:start_ts + self.action_horizon]
            
            # Task description (if available)
            task_prompt = f['/text/prompt'][()].decode('utf-8')
    
    return {
        'qpos': torch.tensor(qpos, dtype=torch.float32),
        'qvel': torch.tensor(qvel, dtype=torch.float32), 
        'images': {k: torch.tensor(v, dtype=torch.float32) / 255.0 
                for k, v in images.items()},
        'actions': torch.tensor(actions, dtype=torch.float32),
        'task_prompt': task_prompt
    }

# Usage example
dataset = SampleLoader(
    episodes=['episode_001.hdf5', 'episode_002.hdf5'],
    camera_names=['cam_high', 'cam_low', 'cam_left_wrist', 'cam_right_wrist']
)

dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
    

Key Implementation Notes

Temporal Sequences:

Use obs_horizon to stack multiple observation frames for temporal understanding

Multi-Camera Data:

Process multiple camera views simultaneously for comprehensive spatial understanding

Action Sequences:

Load action chunks for trajectory prediction and policy learning

Task Context:

Include text prompts and embeddings for language-conditioned learning

Key Dataset Features

Advanced Bimanual Tasks: 117 episodes with natural human-like manipulation strategies
Multi-View Visual Data: 4-camera synchronized recording (480×640, 30 FPS) for robust visual learning
Precise Motion Tracking: 14-DoF joint data at 50 Hz with synchronized gripper states
Research-Ready Format: HDF5, RMB, RLDS (Comming soon), LeRobot (Comming soon) compatible with standard APIs
Diverse Skill Levels: Basic to advanced complexity enabling robot learning approaches

Dataset Growth Overview