Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Facebookresearch Habitat lab Dataset Utils

From Leeroopedia
Knowledge Sources
Domains Embodied_AI, Natural_Language_Processing
Last Updated 2026-02-15 00:00 GMT

Overview

This module provides tokenization, vocabulary management, and shortest-path computation utilities originally authored for the Pythia project, used for Embodied Question Answering and navigation tasks.

Description

The module contains several utilities organized into text processing, vocabulary management, and navigation helpers:

Text Processing:

  • tokenize(sentence, regex, keep, remove) -- Lowercases a sentence, preserves specified tokens (e.g., "'s"), removes specified punctuation, and splits using a regex pattern. Returns a list of stripped, non-empty tokens.
  • load_str_list(fname) -- Loads a text file and returns a list of stripped lines.

Vocabulary Management:

  • VocabDict -- A vocabulary dictionary mapping words to indices and vice versa. Features:
    • Special tokens: <unk>, <pad>, <s>, </s>
    • Can be initialized from a word list or a file path
    • Provides word2idx(w), idx2word(n_w), and tokenize_and_index(sentence) methods
    • token_idx_2_string(tokens) converts token indices back to a question string
    • stoi / itos properties for string-to-index and index-to-string mappings
  • VocabFromText -- Extends VocabDict to build a vocabulary from a collection of sentences with a minimum count threshold. Supports custom tokenization parameters and can optionally use only the UNK token as extra.

Navigation Utilities:

  • get_action_shortest_path(sim, source_position, source_rotation, goal_position, ...) -- Computes the shortest action path from source to goal using ShortestPathFollower. Returns a list of ShortestPathPoint objects. Warns if the path exceeds max_episode_steps.
  • check_and_gen_physics_config() -- Checks for and generates the default physics config JSON file if it does not exist, using Bullet physics with standard gravity and friction settings.

A module-level constant DEFAULT_PHYSICS_CONFIG_PATH is set to "data/default.physics_config.json".

Usage

Use VocabDict and VocabFromText for EQA (Embodied Question Answering) tasks that require text tokenization and vocabulary management. Use get_action_shortest_path for generating expert demonstrations or computing oracle paths in navigation tasks.

Code Reference

Source Location

Signature

def tokenize(
    sentence, regex=SENTENCE_SPLIT_REGEX, keep=("'s"), remove=(",", "?")
) -> List[str]:

class VocabDict:
    def __init__(self, word_list=None, filepath=None):

class VocabFromText(VocabDict):
    def __init__(
        self,
        sentences,
        min_count=1,
        regex=SENTENCE_SPLIT_REGEX,
        keep=(),
        remove=(),
        only_unk_extra=False,
    ):

def get_action_shortest_path(
    sim: "HabitatSim",
    source_position: List[float],
    source_rotation: List[float],
    goal_position: List[float],
    success_distance: float = 0.05,
    max_episode_steps: int = 500,
) -> List[ShortestPathPoint]:

def check_and_gen_physics_config():

Import

from habitat.datasets.utils import (
    tokenize,
    VocabDict,
    VocabFromText,
    get_action_shortest_path,
    check_and_gen_physics_config,
)

I/O Contract

Inputs (tokenize)

Name Type Required Description
sentence str Yes The sentence to tokenize
regex re.Pattern No (default=SENTENCE_SPLIT_REGEX) Regex pattern for splitting
keep tuple No (default=("'s",)) Tokens to preserve by prepending a space
remove tuple No (default=(",", "?")) Tokens to remove entirely

Inputs (get_action_shortest_path)

Name Type Required Description
sim HabitatSim Yes The Habitat simulator instance
source_position List[float] Yes Starting 3D position
source_rotation List[float] Yes Starting rotation as quaternion
goal_position List[float] Yes Target 3D position
success_distance float No (default=0.05) Distance threshold for goal success
max_episode_steps int No (default=500) Maximum number of steps before giving up

Outputs

Name Type Description
tokenize() List[str] List of cleaned, lowercase tokens
VocabDict.word2idx(w) int Integer index of the given word
VocabDict.tokenize_and_index(sentence) List[int] List of integer indices for each token in the sentence
get_action_shortest_path() List[ShortestPathPoint] Sequence of positions, rotations, and actions along the shortest path

Usage Examples

Basic Usage

from habitat.datasets.utils import tokenize, VocabDict, VocabFromText

# Tokenize a sentence
tokens = tokenize("What color is the chair?")
# Result: ["what", "color", "is", "the", "chair"]

# Build vocabulary from a list of words
vocab = VocabDict(word_list=["the", "chair", "is", "red", "blue"])
idx = vocab.word2idx("chair")
word = vocab.idx2word(idx)

# Build vocabulary from text corpus
sentences = ["What color is the chair?", "Where is the table?"]
vocab = VocabFromText(sentences, min_count=1)
indices = vocab.tokenize_and_index("What color is the chair?")

# Get size of vocabulary
size = vocab.get_size()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment