Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Huggingface Datasets Search Dependencies

From Leeroopedia
Knowledge Sources
Domains Search, Infrastructure
Last Updated 2026-02-14 19:00 GMT

Overview

Optional search index environment enabling FAISS nearest-neighbor search and ElasticSearch text search on HuggingFace Datasets. These dependencies are not installed by default and must be added separately when search indexing functionality is needed.

Description

The search environment provides two distinct indexing backends for the Dataset class:

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It supports both CPU and GPU execution. The integration is implemented via the FaissIndex class in search.py, which wraps a faiss.Index object. FAISS GPU support is controlled through the device parameter:

  • A positive integer selects a specific GPU by index (uses faiss.StandardGpuResources() and faiss.index_cpu_to_gpu())
  • A negative integer distributes across all available GPUs (uses faiss.index_cpu_to_all_gpus())
  • A list of positive integers selects specific GPUs (uses faiss.index_cpu_to_gpus_list())
  • None (default) runs on CPU

ElasticSearch provides BM25-based sparse text search. The integration is implemented via the ElasticSearchIndex class in search.py. It requires an external ElasticSearch server (default: localhost:9200) and uses the Python elasticsearch client library to communicate with it. Batch search is parallelized using concurrent.futures.ThreadPoolExecutor.

Usage

These dependencies are needed when calling the following Dataset methods (mixed into Dataset via IndexableMixin):

  • dataset.add_faiss_index(column, device=None, string_factory=None, ...) -- builds a FAISS index from a vector column
  • dataset.add_faiss_index_from_external_arrays(external_arrays, index_name, ...) -- builds a FAISS index from external numpy arrays
  • dataset.save_faiss_index(index_name, file) / dataset.load_faiss_index(index_name, file) -- persist/restore FAISS indices
  • dataset.add_elasticsearch_index(column, host=None, port=None, ...) -- builds an ElasticSearch index from a text column
  • dataset.load_elasticsearch_index(index_name, es_index_name, ...) -- loads an existing ElasticSearch index
  • dataset.search(index_name, query, k=10) / dataset.search_batch(index_name, queries, k=10) -- query either index type
  • dataset.get_nearest_examples(index_name, query, k=10) / dataset.get_nearest_examples_batch(index_name, queries, k=10) -- retrieve actual examples

System Requirements

Category Requirement Notes
Hardware CPU (FAISS-CPU) or NVIDIA GPU (FAISS-GPU) GPU is optional; CPU is the default
External Service ElasticSearch server (default: localhost:9200) Only required for ElasticSearch index; not needed for FAISS
Python numpy Already a core dependency of the datasets library

Dependencies

Python Packages

  • faiss-cpu >= 1.8.0.post1 (or faiss-gpu for GPU support)
  • elasticsearch >= 7.17.12, < 8.0.0

Both are detected at runtime via importlib.util.find_spec():

# src/datasets/search.py
_has_faiss = importlib.util.find_spec("faiss") is not None
_has_elasticsearch = importlib.util.find_spec("elasticsearch") is not None

Credentials

No credentials are required for FAISS. ElasticSearch may require authentication depending on server configuration. The ElasticSearchIndex constructor accepts an optional es_client parameter, allowing a pre-configured elasticsearch.Elasticsearch instance with custom authentication to be passed in.

Quick Install

# For FAISS (CPU)
pip install faiss-cpu>=1.8.0.post1

# For FAISS (GPU via conda -- recommended for GPU support)
conda install -c pytorch faiss-gpu

# For FAISS (GPU via pip -- community package, may not have latest version)
pip install faiss-gpu

# For ElasticSearch
pip install "elasticsearch>=7.17.12,<8.0.0"

Code Evidence

FAISS Detection and Error

# src/datasets/search.py, line 30
_has_faiss = importlib.util.find_spec("faiss") is not None

# src/datasets/search.py, lines 248-253 (FaissIndex.__init__)
if not _has_faiss:
    raise ImportError(
        "You must install Faiss to use FaissIndex. To do so you can run "
        "`conda install -c pytorch faiss-cpu` or `conda install -c pytorch faiss-gpu`. "
        "A community supported package is also available on pypi: "
        "`pip install faiss-cpu` or `pip install faiss-gpu`. "
        "Note that pip may not have the latest version of FAISS, and thus, "
        "some of the latest features and bug fixes may not be available."
    )

FAISS GPU Device Handling

# src/datasets/search.py, lines 316-347 (FaissIndex._faiss_index_to_device)
@staticmethod
def _faiss_index_to_device(index, device=None):
    if device is None:
        return index
    import faiss
    if isinstance(device, int):
        if device > -1:
            faiss_res = faiss.StandardGpuResources()
            index = faiss.index_cpu_to_gpu(faiss_res, device, index)
        else:
            index = faiss.index_cpu_to_all_gpus(index)
    elif isinstance(device, (list, tuple)):
        index = faiss.index_cpu_to_gpus_list(index, gpus=list(device))
    return index

ElasticSearch Detection and Error

# src/datasets/search.py, line 29
_has_elasticsearch = importlib.util.find_spec("elasticsearch") is not None

# src/datasets/search.py, lines 116-119 (ElasticSearchIndex.__init__)
if not _has_elasticsearch:
    raise ImportError(
        "You must install ElasticSearch to use ElasticSearchIndex. "
        "To do so you can run `pip install elasticsearch==7.7.1 for example`"
    )

setup.py Version Constraints

# setup.py, lines 167-168
"elasticsearch>=7.17.12,<8.0.0",  # 8.0 asks users to provide hosts or cloud_id
"faiss-cpu>=1.8.0.post1",         # Pins numpy < 2

# setup.py, lines 193-194
NUMPY2_INCOMPATIBLE_LIBRARIES = [
    "faiss-cpu",
]

Common Errors

Error Trigger Resolution
ImportError: You must install Faiss to use FaissIndex. To do so you can run `conda install -c pytorch faiss-cpu` or `conda install -c pytorch faiss-gpu`. A community supported package is also available on pypi: `pip install faiss-cpu` or `pip install faiss-gpu`. Note that pip may not have the latest version of FAISS, and thus, some of the latest features and bug fixes may not be available. Calling FaissIndex() or dataset.add_faiss_index() without faiss installed Install faiss-cpu or faiss-gpu
ImportError: You must install ElasticSearch to use ElasticSearchIndex. To do so you can run `pip install elasticsearch==7.7.1 for example` Calling ElasticSearchIndex() or dataset.add_elasticsearch_index() without elasticsearch installed Install elasticsearch>=7.17.12,<8.0.0
ValueError: Please specify either `es_client` or `(host, port)`, but not both. Passing both es_client and host/port to ElasticSearchIndex Use one connection method only
TypeError: The argument type: ... is not expected. Please pass in either nothing, a positive int, a negative int, or a list of positive ints. Passing an invalid device type to FaissIndex Use None, an int, or a list of ints for device
ValueError: Index size should match Dataset size Loading a saved FAISS index whose size does not match the current dataset length Ensure the FAISS index was built from the same dataset

Compatibility Notes

  • faiss-cpu is listed in NUMPY2_INCOMPATIBLE_LIBRARIES in setup.py -- it pins numpy < 2 and is excluded from NumPy 2 test runs
  • FAISS GPU requires an NVIDIA GPU with appropriate CUDA drivers; the faiss-gpu conda package from the pytorch channel is recommended over pip
  • ElasticSearch 8.x is not supported due to the < 8.0.0 version constraint (ES 8.0 changed the client constructor to require hosts or cloud_id)
  • ElasticSearch < 7.17.12 is excluded because versions before 7.9.1 contained legacy numpy.float_ usage

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment