Implementation:Facebookresearch Habitat lab Dataset Utils

Knowledge Sources	Facebookresearch_Habitat_lab
Domains	Embodied_AI, Natural_Language_Processing
Last Updated	2026-02-15 00:00 GMT

Overview

This module provides tokenization, vocabulary management, and shortest-path computation utilities originally authored for the Pythia project, used for Embodied Question Answering and navigation tasks.

Description

The module contains several utilities organized into text processing, vocabulary management, and navigation helpers:

Text Processing:

tokenize(sentence, regex, keep, remove) -- Lowercases a sentence, preserves specified tokens (e.g., "'s"), removes specified punctuation, and splits using a regex pattern. Returns a list of stripped, non-empty tokens.
load_str_list(fname) -- Loads a text file and returns a list of stripped lines.

Vocabulary Management:

VocabDict -- A vocabulary dictionary mapping words to indices and vice versa. Features:
- Special tokens: <unk>, <pad>, <s>, </s>
- Can be initialized from a word list or a file path
- Provides word2idx(w), idx2word(n_w), and tokenize_and_index(sentence) methods
- token_idx_2_string(tokens) converts token indices back to a question string
- stoi / itos properties for string-to-index and index-to-string mappings

VocabFromText -- Extends VocabDict to build a vocabulary from a collection of sentences with a minimum count threshold. Supports custom tokenization parameters and can optionally use only the UNK token as extra.

Navigation Utilities:

get_action_shortest_path(sim, source_position, source_rotation, goal_position, ...) -- Computes the shortest action path from source to goal using ShortestPathFollower. Returns a list of ShortestPathPoint objects. Warns if the path exceeds max_episode_steps.

check_and_gen_physics_config() -- Checks for and generates the default physics config JSON file if it does not exist, using Bullet physics with standard gravity and friction settings.

A module-level constant DEFAULT_PHYSICS_CONFIG_PATH is set to "data/default.physics_config.json".

Usage

Use VocabDict and VocabFromText for EQA (Embodied Question Answering) tasks that require text tokenization and vocabulary management. Use get_action_shortest_path for generating expert demonstrations or computing oracle paths in navigation tasks.

Code Reference

Source Location

Repository: Facebookresearch_Habitat_lab
File: habitat-lab/habitat/datasets/utils.py
Lines: 1-229

Signature

def tokenize(
    sentence, regex=SENTENCE_SPLIT_REGEX, keep=("'s"), remove=(",", "?")
) -> List[str]:

class VocabDict:
    def __init__(self, word_list=None, filepath=None):

class VocabFromText(VocabDict):
    def __init__(
        self,
        sentences,
        min_count=1,
        regex=SENTENCE_SPLIT_REGEX,
        keep=(),
        remove=(),
        only_unk_extra=False,
    ):

def get_action_shortest_path(
    sim: "HabitatSim",
    source_position: List[float],
    source_rotation: List[float],
    goal_position: List[float],
    success_distance: float = 0.05,
    max_episode_steps: int = 500,
) -> List[ShortestPathPoint]:

def check_and_gen_physics_config():

Import

from habitat.datasets.utils import (
    tokenize,
    VocabDict,
    VocabFromText,
    get_action_shortest_path,
    check_and_gen_physics_config,
)

I/O Contract

Inputs (tokenize)

Name	Type	Required	Description
sentence	str	Yes	The sentence to tokenize
regex	re.Pattern	No (default=SENTENCE_SPLIT_REGEX)	Regex pattern for splitting
keep	tuple	No (default=("'s",))	Tokens to preserve by prepending a space
remove	tuple	No (default=(",", "?"))	Tokens to remove entirely

Inputs (get_action_shortest_path)

Name	Type	Required	Description
sim	HabitatSim	Yes	The Habitat simulator instance
source_position	List[float]	Yes	Starting 3D position
source_rotation	List[float]	Yes	Starting rotation as quaternion
goal_position	List[float]	Yes	Target 3D position
success_distance	float	No (default=0.05)	Distance threshold for goal success
max_episode_steps	int	No (default=500)	Maximum number of steps before giving up

Outputs

Name	Type	Description
tokenize()	List[str]	List of cleaned, lowercase tokens
VocabDict.word2idx(w)	int	Integer index of the given word
VocabDict.tokenize_and_index(sentence)	List[int]	List of integer indices for each token in the sentence
get_action_shortest_path()	List[ShortestPathPoint]	Sequence of positions, rotations, and actions along the shortest path

Usage Examples

Basic Usage

from habitat.datasets.utils import tokenize, VocabDict, VocabFromText

# Tokenize a sentence
tokens = tokenize("What color is the chair?")
# Result: ["what", "color", "is", "the", "chair"]

# Build vocabulary from a list of words
vocab = VocabDict(word_list=["the", "chair", "is", "red", "blue"])
idx = vocab.word2idx("chair")
word = vocab.idx2word(idx)

# Build vocabulary from text corpus
sentences = ["What color is the chair?", "Where is the table?"]
vocab = VocabFromText(sentences, min_count=1)
indices = vocab.tokenize_and_index("What color is the chair?")

# Get size of vocabulary
size = vocab.get_size()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment