Implementation:Facebookresearch Habitat lab SimpleCNN

Knowledge Sources	Facebookresearch_Habitat_lab
Domains	Embodied_AI, Visual_Encoding
Last Updated	2026-02-15 00:00 GMT

Overview

SimpleCNN is a lightweight 3-layer convolutional neural network that takes RGB and/or depth observations and produces a fixed-size embedding vector for use in policy networks.

Description

SimpleCNN extends nn.Module and inspects the observation space to determine whether RGB and/or depth inputs are available. It constructs a sequential CNN with three convolutional layers (kernel sizes 8x8, 4x4, 3x3 with strides 4, 2, 1) followed by a flatten operation and a fully connected layer mapping to the specified output size. The channel counts progress as: (input_channels) -> 32 -> 64 -> 32 -> output_size. ReLU activations are applied after the first two convolutions and after the final linear layer. All convolutional and linear weights are initialized using Kaiming normal initialization tuned for ReLU. If neither RGB nor depth is present in the observation space (is_blind property), an empty sequential module is used. During forward pass, RGB observations are normalized to [0, 1] and both modalities are permuted to NCHW format before concatenation.

Usage

Use SimpleCNN as a visual encoder in RL policies that require a compact feature representation from RGB and/or depth observations. It is designed for straightforward visual processing tasks where a lightweight architecture is sufficient.

Code Reference

Source Location

Repository: Facebookresearch_Habitat_lab
File: habitat-baselines/habitat_baselines/rl/models/simple_cnn.py
Lines: 12-158

Signature

class SimpleCNN(nn.Module):
    def __init__(
        self,
        observation_space,
        output_size,
    ):
    def forward(self, observations: Dict[str, torch.Tensor]):

Import

from habitat_baselines.rl.models.simple_cnn import SimpleCNN

I/O Contract

Inputs

Name	Type	Required	Description
observation_space	gym.spaces.Dict	Yes	Observation space containing optional "rgb" and/or "depth" entries with shape (H, W, C)
output_size	int	Yes	Dimensionality of the output embedding vector
observations	Dict[str, torch.Tensor]	Yes	Dictionary of observation tensors passed to forward(); expects "rgb" as (B, H, W, 3) uint8 and/or "depth" as (B, H, W, 1) float

Outputs

Name	Type	Description
embedding	torch.Tensor	Embedding vector of shape (batch_size, output_size)

Key Properties

is_blind

@property
def is_blind(self) -> bool

Returns True if neither RGB nor depth channels are present in the observation space.

Architecture

Layer	Type	Kernel	Stride	Output Channels
Conv1	Conv2d	8x8	4	32
ReLU1	ReLU	-	-	-
Conv2	Conv2d	4x4	2	64
ReLU2	ReLU	-	-	-
Conv3	Conv2d	3x3	1	32
Flatten	Flatten	-	-	-
FC	Linear	-	-	output_size
ReLU3	ReLU	-	-	-

Usage Examples

Basic Usage

import torch
import gym.spaces as spaces
import numpy as np
from habitat_baselines.rl.models.simple_cnn import SimpleCNN

# Define observation space with RGB and depth
obs_space = spaces.Dict({
    "rgb": spaces.Box(low=0, high=255, shape=(256, 256, 3), dtype=np.uint8),
    "depth": spaces.Box(low=0.0, high=1.0, shape=(256, 256, 1), dtype=np.float32),
})

cnn = SimpleCNN(observation_space=obs_space, output_size=512)

# Forward pass
observations = {
    "rgb": torch.randint(0, 255, (8, 256, 256, 3), dtype=torch.uint8),
    "depth": torch.rand(8, 256, 256, 1),
}
embedding = cnn(observations)
# embedding shape: (8, 512)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment