Implementation:Facebookresearch Habitat lab EQA Models

Knowledge Sources	Facebookresearch_Habitat_lab
Domains	Embodied_AI, Embodied_Question_Answering, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

This module implements the neural network models used for Embodied Question Answering (EQA), including a multi-task CNN encoder-decoder, question LSTM encoder, VQA attention model, navigation planner-controller model, and supporting components.

Description

The file contains several key classes and utility functions for the EQA pipeline:

build_mlp -- A utility function that constructs a multi-layer perceptron (MLP) with configurable hidden dimensions, batch normalization, dropout, and optional sigmoid activation.

MultitaskCNN -- A convolutional neural network that jointly predicts semantic segmentation, depth maps, and autoencoder RGB reconstruction from RGB input images. It uses a four-block convolutional encoder with skip connections (similar to FCN) for multi-scale feature fusion. When used as an encoder only (only_encoder=True), it outputs a flattened 4608-dimensional feature vector.

QuestionLstmEncoder -- An LSTM-based encoder that takes tokenized question sequences and returns the hidden state at the last non-padding token position. It handles variable-length sequences by gathering the output at the appropriate index.

VqaLstmCnnAttentionModel -- A full VQA model that combines the MultitaskCNN (as a pretrained encoder), the QuestionLstmEncoder, and an attention mechanism. It computes attention weights over a sequence of image frames conditioned on the question, producing element-wise multiplicative fusion features that are classified into answer tokens.

MaskedNLLCriterion -- A masked negative log-likelihood loss criterion that computes loss only over non-masked positions, used for variable-length action sequences.

NavPlannerControllerModel -- A hierarchical navigation model with a planner RNN (for high-level waypoint selection) and a controller MLP (for low-level action decisions). The planner generates navigation actions from image and question features, while the controller decides whether to continue the current action or replan.

NavRnn -- A configurable RNN module (LSTM or GRU) that accepts optional image, question, and action inputs. It supports both full-sequence forward passes (with packed sequences) and single-step forward passes for online inference.

Usage

These models are used in the EQA training pipeline. MultitaskCNN is pre-trained as an encoder-decoder for feature extraction, VqaLstmCnnAttentionModel is trained for visual question answering, and NavPlannerControllerModel is trained with the PACMAN algorithm for navigation. The models are instantiated and managed by the corresponding trainer classes (EQACNNPretrainTrainer, VQATrainer, PACMANTrainer).

Code Reference

Source Location

Repository: Facebookresearch_Habitat_lab
File: habitat-baselines/habitat_baselines/il/models/models.py
Lines: 1-727

Signature

def build_mlp(
    input_dim: int,
    hidden_dims: Iterable[int],
    output_dim: int,
    use_batchnorm: bool = False,
    dropout: float = 0,
    add_sigmoid: bool = True,
) -> nn.Sequential: ...

class MultitaskCNN(nn.Module):
    def __init__(
        self,
        num_classes: int = 41,
        only_encoder: bool = False,
        pretrained: bool = True,
        checkpoint_path: str = "data/eqa/eqa_cnn_pretrain/checkpoints/epoch_5.ckpt",
        freeze_encoder: bool = False,
    ) -> None: ...

class QuestionLstmEncoder(nn.Module):
    def __init__(
        self,
        token_to_idx: Dict,
        wordvec_dim: int = 64,
        rnn_dim: int = 64,
        rnn_num_layers: int = 2,
        rnn_dropout: float = 0,
    ) -> None: ...

class VqaLstmCnnAttentionModel(nn.Module):
    def __init__(
        self,
        q_vocab: Dict,
        ans_vocab: Dict,
        eqa_cnn_pretrain_ckpt_path: str,
        freeze_encoder: bool = False,
        image_feat_dim: int = 64,
        question_wordvec_dim: int = 64,
        question_hidden_dim: int = 64,
        question_num_layers: int = 2,
        question_dropout: float = 0.5,
        fc_use_batchnorm: bool = False,
        fc_dropout: float = 0.5,
        fc_dims: Iterable[int] = (64,),
    ) -> None: ...

class MaskedNLLCriterion(nn.Module):
    def __init__(self) -> None: ...

class NavPlannerControllerModel(nn.Module):
    def __init__(
        self,
        q_vocab: Dict,
        num_output: int = 4,
        question_wordvec_dim: int = 64,
        question_hidden_dim: int = 64,
        question_num_layers: int = 2,
        question_dropout: float = 0.5,
        planner_rnn_image_feat_dim: int = 128,
        planner_rnn_action_embed_dim: int = 32,
        planner_rnn_type: str = "GRU",
        planner_rnn_hidden_dim: int = 1024,
        planner_rnn_num_layers: int = 1,
        planner_rnn_dropout: float = 0,
        controller_fc_dims: Iterable[int] = (256,),
    ) -> None: ...

class NavRnn(nn.Module):
    def __init__(
        self,
        image_input: bool = False,
        image_feat_dim: int = 128,
        question_input: bool = False,
        question_embed_dim: int = 128,
        action_input: bool = False,
        action_embed_dim: int = 32,
        num_actions: int = 4,
        mode: str = "sl",
        rnn_type: str = "LSTM",
        rnn_hidden_dim: int = 128,
        rnn_num_layers: int = 2,
        rnn_dropout: float = 0,
        return_states: bool = False,
    ) -> None: ...

Import

from habitat_baselines.il.models.models import (
    build_mlp,
    MultitaskCNN,
    QuestionLstmEncoder,
    VqaLstmCnnAttentionModel,
    MaskedNLLCriterion,
    NavPlannerControllerModel,
    NavRnn,
)

I/O Contract

MultitaskCNN Inputs

Name	Type	Required	Description
x	torch.Tensor	Yes	Input RGB image tensor of shape (N, 3, H, W), typically (N, 3, 256, 256)

MultitaskCNN Outputs

Name	Type	Description
out_seg	torch.Tensor	Segmentation prediction of shape (N, num_classes, H, W) -- or flattened feature vector of shape (N, 4608) if only_encoder=True
out_depth	torch.Tensor	Depth prediction of shape (N, 1, H, W) -- only when only_encoder=False
out_ae	torch.Tensor	Autoencoder RGB reconstruction of shape (N, 3, H, W) -- only when only_encoder=False

VqaLstmCnnAttentionModel Inputs

Name	Type	Required	Description
images	torch.Tensor	Yes	Image frames tensor of shape (N, T, 3, 256, 256) where T is the number of frames
questions	torch.Tensor	Yes	Tokenized question tensor of shape (N, max_question_length)

VqaLstmCnnAttentionModel Outputs

Name	Type	Description
scores	torch.Tensor	Answer classification scores of shape (N, num_answers)
att_probs	torch.Tensor	Attention probabilities over image frames of shape (N, T)

NavPlannerControllerModel Inputs

Name	Type	Required	Description
questions	torch.Tensor	Yes	Tokenized question tensor of shape (N, max_question_length)
planner_img_feats	torch.Tensor	Yes	Planner image features of shape (N, T_p, 4608)
planner_actions_in	torch.Tensor	Yes	Planner input actions of shape (N, T_p)
planner_action_lengths	torch.Tensor	Yes	Lengths of planner action sequences
planner_hidden_index	torch.Tensor	Yes	Indices into planner hidden states for controller
controller_img_feats	torch.Tensor	Yes	Controller image features of shape (N, T_c, 4608)
controller_actions_in	torch.Tensor	Yes	Controller input actions of shape (N, T_c)
controller_action_lengths	torch.Tensor	Yes	Lengths of controller action sequences

NavPlannerControllerModel Outputs

Name	Type	Description
planner_scores	torch.Tensor	Planner action scores
controller_scores	torch.Tensor	Controller binary decision scores (continue/replan)
planner_hidden	torch.Tensor	Final planner hidden states

Usage Examples

Basic Usage: MultitaskCNN as Encoder

import torch
from habitat_baselines.il.models.models import MultitaskCNN

# Initialize as encoder-only with pretrained weights
cnn = MultitaskCNN(
    num_classes=41,
    only_encoder=True,
    pretrained=True,
    checkpoint_path="data/eqa/eqa_cnn_pretrain/checkpoints/epoch_5.ckpt",
    freeze_encoder=True,
)

# Extract features from a batch of images
images = torch.randn(4, 3, 256, 256)
features = cnn(images)  # shape: (4, 4608)

Basic Usage: VQA Model

import torch
from habitat_baselines.il.models.models import VqaLstmCnnAttentionModel

q_vocab = {"<pad>": 0, "<s>": 1, "</s>": 2, "what": 3, "color": 4}
ans_vocab = {"red": 0, "blue": 1, "green": 2}

model = VqaLstmCnnAttentionModel(
    q_vocab=q_vocab,
    ans_vocab=ans_vocab,
    eqa_cnn_pretrain_ckpt_path="data/eqa/eqa_cnn_pretrain/checkpoints/epoch_5.ckpt",
)

# frames: batch of 2, 5 frames each, 3x256x256
frames = torch.randn(2, 5, 3, 256, 256)
questions = torch.LongTensor([[1, 3, 4, 2, 0], [1, 3, 2, 0, 0]])

scores, att_probs = model(frames, questions)
# scores shape: (2, 3)  -- one score per answer
# att_probs shape: (2, 5)  -- attention weights over 5 frames

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment