Implementation:Facebookresearch Habitat lab Dataset Utils
| Knowledge Sources | |
|---|---|
| Domains | Embodied_AI, Natural_Language_Processing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
This module provides tokenization, vocabulary management, and shortest-path computation utilities originally authored for the Pythia project, used for Embodied Question Answering and navigation tasks.
Description
The module contains several utilities organized into text processing, vocabulary management, and navigation helpers:
Text Processing:
- tokenize(sentence, regex, keep, remove) -- Lowercases a sentence, preserves specified tokens (e.g., "'s"), removes specified punctuation, and splits using a regex pattern. Returns a list of stripped, non-empty tokens.
- load_str_list(fname) -- Loads a text file and returns a list of stripped lines.
Vocabulary Management:
- VocabDict -- A vocabulary dictionary mapping words to indices and vice versa. Features:
- Special tokens:
<unk>,<pad>,<s>,</s> - Can be initialized from a word list or a file path
- Provides
word2idx(w),idx2word(n_w), andtokenize_and_index(sentence)methods token_idx_2_string(tokens)converts token indices back to a question stringstoi/itosproperties for string-to-index and index-to-string mappings
- Special tokens:
- VocabFromText -- Extends VocabDict to build a vocabulary from a collection of sentences with a minimum count threshold. Supports custom tokenization parameters and can optionally use only the UNK token as extra.
Navigation Utilities:
- get_action_shortest_path(sim, source_position, source_rotation, goal_position, ...) -- Computes the shortest action path from source to goal using ShortestPathFollower. Returns a list of ShortestPathPoint objects. Warns if the path exceeds max_episode_steps.
- check_and_gen_physics_config() -- Checks for and generates the default physics config JSON file if it does not exist, using Bullet physics with standard gravity and friction settings.
A module-level constant DEFAULT_PHYSICS_CONFIG_PATH is set to "data/default.physics_config.json".
Usage
Use VocabDict and VocabFromText for EQA (Embodied Question Answering) tasks that require text tokenization and vocabulary management. Use get_action_shortest_path for generating expert demonstrations or computing oracle paths in navigation tasks.
Code Reference
Source Location
- Repository: Facebookresearch_Habitat_lab
- File: habitat-lab/habitat/datasets/utils.py
- Lines: 1-229
Signature
def tokenize(
sentence, regex=SENTENCE_SPLIT_REGEX, keep=("'s"), remove=(",", "?")
) -> List[str]:
class VocabDict:
def __init__(self, word_list=None, filepath=None):
class VocabFromText(VocabDict):
def __init__(
self,
sentences,
min_count=1,
regex=SENTENCE_SPLIT_REGEX,
keep=(),
remove=(),
only_unk_extra=False,
):
def get_action_shortest_path(
sim: "HabitatSim",
source_position: List[float],
source_rotation: List[float],
goal_position: List[float],
success_distance: float = 0.05,
max_episode_steps: int = 500,
) -> List[ShortestPathPoint]:
def check_and_gen_physics_config():
Import
from habitat.datasets.utils import (
tokenize,
VocabDict,
VocabFromText,
get_action_shortest_path,
check_and_gen_physics_config,
)
I/O Contract
Inputs (tokenize)
| Name | Type | Required | Description |
|---|---|---|---|
| sentence | str | Yes | The sentence to tokenize |
| regex | re.Pattern | No (default=SENTENCE_SPLIT_REGEX) | Regex pattern for splitting |
| keep | tuple | No (default=("'s",)) | Tokens to preserve by prepending a space |
| remove | tuple | No (default=(",", "?")) | Tokens to remove entirely |
Inputs (get_action_shortest_path)
| Name | Type | Required | Description |
|---|---|---|---|
| sim | HabitatSim | Yes | The Habitat simulator instance |
| source_position | List[float] | Yes | Starting 3D position |
| source_rotation | List[float] | Yes | Starting rotation as quaternion |
| goal_position | List[float] | Yes | Target 3D position |
| success_distance | float | No (default=0.05) | Distance threshold for goal success |
| max_episode_steps | int | No (default=500) | Maximum number of steps before giving up |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenize() | List[str] | List of cleaned, lowercase tokens |
| VocabDict.word2idx(w) | int | Integer index of the given word |
| VocabDict.tokenize_and_index(sentence) | List[int] | List of integer indices for each token in the sentence |
| get_action_shortest_path() | List[ShortestPathPoint] | Sequence of positions, rotations, and actions along the shortest path |
Usage Examples
Basic Usage
from habitat.datasets.utils import tokenize, VocabDict, VocabFromText
# Tokenize a sentence
tokens = tokenize("What color is the chair?")
# Result: ["what", "color", "is", "the", "chair"]
# Build vocabulary from a list of words
vocab = VocabDict(word_list=["the", "chair", "is", "red", "blue"])
idx = vocab.word2idx("chair")
word = vocab.idx2word(idx)
# Build vocabulary from text corpus
sentences = ["What color is the chair?", "Where is the table?"]
vocab = VocabFromText(sentences, min_count=1)
indices = vocab.tokenize_and_index("What color is the chair?")
# Get size of vocabulary
size = vocab.get_size()