Dataset Details
Comprehensive information about data structure, formats, and specifications
Get Started with the Dataset
Download the complete dataset and explore comprehensive documentation to start your research.
Group | Description (Dimensions) |
---|---|
action |
Leader joint position data (14,) |
observations |
Sensor observations collected per timestep (group contains depth and image subgroups) |
depth |
Depth camera frames |
dcam_high |
High-resolution depth image (480, 640) |
dcam_low |
Wide-angle depth image (480, 640) |
images |
RGB camera frames |
cam_high |
Overhead RGB (480, 640, 3) - JPEG compressed with OpenCV |
cam_left_wrist |
Left wrist RGB (480, 640, 3) - JPEG compressed with OpenCV |
cam_low |
Wide-angle RGB (480, 640, 3) - JPEG compressed with OpenCV |
cam_right_wrist |
Right wrist RGB (480, 640, 3) - JPEG compressed with OpenCV |
qpos |
Follower joint position data (14,) |
qvel |
Follower joint velocity data (14,) |
text |
Textual metadata (prompts and embeddings) |
prompt |
Text prompt (string) |
text_embedding |
Text embedding vector (384,) |
Dataset Usage Example
Below is a simplified example of how to load and process the AIST Bimanual Manipulation Dataset for robot learning applications.PyTorch Dataset Implementation
Core implementation for loading bimanual manipulation data
import torch
import h5py
import numpy as np
from torch.utils.data import Dataset
class SampleLoader(Dataset):
"""Dataset loader for AIST Bimanual Manipulation data"""
def __init__(self, episodes, camera_names=['cam_high', 'cam_low']):
self.episodes = episodes
self.camera_names = camera_names
self.is_compressed = True
# Load valid sample indices
self._load_episodes()
def _load_episodes(self):
"""Load episode information and calculate valid samples"""
self.samples = []
for episode_path in self.episodes:
with h5py.File(episode_path, 'r') as f:
episode_len = f['/observations/qpos'].shape[0]
# Ensure we have enough data for observation and action sequences
min_start = self.obs_horizon - 1
max_start = episode_len - self.action_horizon
for start_ts in range(min_start, max_start + 1):
self.samples.append((episode_path, start_ts))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
episode_path, start_ts = self.samples[idx]
with h5py.File(episode_path, 'r') as f:
# Load observation sequence (multiple frames)
obs_start = start_ts - self.obs_horizon + 1
obs_end = start_ts + 1
# Joint positions and velocities
qpos = f['/observations/qpos'][obs_start:obs_end]
qvel = f['/observations/qvel'][obs_start:obs_end]
# Multi-camera images
images = {}
for cam in self.camera_names:
images[cam] = f[f'/observations/images/{cam}'][obs_start:obs_end]
# Decompress
if self.is_compressed:
# Decompress images
for cam_name in images:
decompressed_image = []
for img_compressed in images[cam_name]:
decompressed_img = cv2.imdecode(img_compressed, 1)
decompressed_image.append(np.array(decompressed_img))
images[cam_name] = np.array(decompressed_image)
# Action sequence
actions = f['/action'][start_ts:start_ts + self.action_horizon]
# Task description (if available)
task_prompt = f['/text/prompt'][()].decode('utf-8')
return {
'qpos': torch.tensor(qpos, dtype=torch.float32),
'qvel': torch.tensor(qvel, dtype=torch.float32),
'images': {k: torch.tensor(v, dtype=torch.float32) / 255.0
for k, v in images.items()},
'actions': torch.tensor(actions, dtype=torch.float32),
'task_prompt': task_prompt
}
# Usage example
dataset = SampleLoader(
episodes=['episode_001.hdf5', 'episode_002.hdf5'],
camera_names=['cam_high', 'cam_low', 'cam_left_wrist', 'cam_right_wrist']
)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
Key Implementation Notes
Temporal Sequences:
Use obs_horizon
to stack multiple observation frames for temporal understanding
Multi-Camera Data:
Process multiple camera views simultaneously for comprehensive spatial understanding
Action Sequences:
Load action chunks for trajectory prediction and policy learning
Task Context:
Include text prompts and embeddings for language-conditioned learning
Key Dataset Features
Advanced Bimanual Tasks: 117 episodes with natural human-like manipulation strategies
Multi-View Visual Data: 4-camera synchronized recording (480×640, 30 FPS) for robust visual learning
Precise Motion Tracking: 14-DoF joint data at 50 Hz with synchronized gripper states
Research-Ready Format: HDF5, RMB, RLDS (Comming soon), LeRobot (Comming soon) compatible with standard APIs
Diverse Skill Levels: Basic to advanced complexity enabling robot learning approaches