Implementation:Facebookresearch Habitat lab SimpleCNN
| Knowledge Sources | |
|---|---|
| Domains | Embodied_AI, Visual_Encoding |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
SimpleCNN is a lightweight 3-layer convolutional neural network that takes RGB and/or depth observations and produces a fixed-size embedding vector for use in policy networks.
Description
SimpleCNN extends nn.Module and inspects the observation space to determine whether RGB and/or depth inputs are available. It constructs a sequential CNN with three convolutional layers (kernel sizes 8x8, 4x4, 3x3 with strides 4, 2, 1) followed by a flatten operation and a fully connected layer mapping to the specified output size. The channel counts progress as: (input_channels) -> 32 -> 64 -> 32 -> output_size. ReLU activations are applied after the first two convolutions and after the final linear layer. All convolutional and linear weights are initialized using Kaiming normal initialization tuned for ReLU. If neither RGB nor depth is present in the observation space (is_blind property), an empty sequential module is used. During forward pass, RGB observations are normalized to [0, 1] and both modalities are permuted to NCHW format before concatenation.
Usage
Use SimpleCNN as a visual encoder in RL policies that require a compact feature representation from RGB and/or depth observations. It is designed for straightforward visual processing tasks where a lightweight architecture is sufficient.
Code Reference
Source Location
- Repository: Facebookresearch_Habitat_lab
- File: habitat-baselines/habitat_baselines/rl/models/simple_cnn.py
- Lines: 12-158
Signature
class SimpleCNN(nn.Module):
def __init__(
self,
observation_space,
output_size,
):
def forward(self, observations: Dict[str, torch.Tensor]):
Import
from habitat_baselines.rl.models.simple_cnn import SimpleCNN
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| observation_space | gym.spaces.Dict | Yes | Observation space containing optional "rgb" and/or "depth" entries with shape (H, W, C) |
| output_size | int | Yes | Dimensionality of the output embedding vector |
| observations | Dict[str, torch.Tensor] | Yes | Dictionary of observation tensors passed to forward(); expects "rgb" as (B, H, W, 3) uint8 and/or "depth" as (B, H, W, 1) float |
Outputs
| Name | Type | Description |
|---|---|---|
| embedding | torch.Tensor | Embedding vector of shape (batch_size, output_size) |
Key Properties
is_blind
@property
def is_blind(self) -> bool
Returns True if neither RGB nor depth channels are present in the observation space.
Architecture
| Layer | Type | Kernel | Stride | Output Channels |
|---|---|---|---|---|
| Conv1 | Conv2d | 8x8 | 4 | 32 |
| ReLU1 | ReLU | - | - | - |
| Conv2 | Conv2d | 4x4 | 2 | 64 |
| ReLU2 | ReLU | - | - | - |
| Conv3 | Conv2d | 3x3 | 1 | 32 |
| Flatten | Flatten | - | - | - |
| FC | Linear | - | - | output_size |
| ReLU3 | ReLU | - | - | - |
Usage Examples
Basic Usage
import torch
import gym.spaces as spaces
import numpy as np
from habitat_baselines.rl.models.simple_cnn import SimpleCNN
# Define observation space with RGB and depth
obs_space = spaces.Dict({
"rgb": spaces.Box(low=0, high=255, shape=(256, 256, 3), dtype=np.uint8),
"depth": spaces.Box(low=0.0, high=1.0, shape=(256, 256, 1), dtype=np.float32),
})
cnn = SimpleCNN(observation_space=obs_space, output_size=512)
# Forward pass
observations = {
"rgb": torch.randint(0, 255, (8, 256, 256, 3), dtype=torch.uint8),
"depth": torch.rand(8, 256, 256, 1),
}
embedding = cnn(observations)
# embedding shape: (8, 512)