Environment:Huggingface Datasets Search Dependencies
| Knowledge Sources | |
|---|---|
| Domains | Search, Infrastructure |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Optional search index environment enabling FAISS nearest-neighbor search and ElasticSearch text search on HuggingFace Datasets. These dependencies are not installed by default and must be added separately when search indexing functionality is needed.
Description
The search environment provides two distinct indexing backends for the Dataset class:
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It supports both CPU and GPU execution. The integration is implemented via the FaissIndex class in search.py, which wraps a faiss.Index object. FAISS GPU support is controlled through the device parameter:
- A positive integer selects a specific GPU by index (uses
faiss.StandardGpuResources()andfaiss.index_cpu_to_gpu()) - A negative integer distributes across all available GPUs (uses
faiss.index_cpu_to_all_gpus()) - A list of positive integers selects specific GPUs (uses
faiss.index_cpu_to_gpus_list()) None(default) runs on CPU
ElasticSearch provides BM25-based sparse text search. The integration is implemented via the ElasticSearchIndex class in search.py. It requires an external ElasticSearch server (default: localhost:9200) and uses the Python elasticsearch client library to communicate with it. Batch search is parallelized using concurrent.futures.ThreadPoolExecutor.
Usage
These dependencies are needed when calling the following Dataset methods (mixed into Dataset via IndexableMixin):
dataset.add_faiss_index(column, device=None, string_factory=None, ...)-- builds a FAISS index from a vector columndataset.add_faiss_index_from_external_arrays(external_arrays, index_name, ...)-- builds a FAISS index from external numpy arraysdataset.save_faiss_index(index_name, file)/dataset.load_faiss_index(index_name, file)-- persist/restore FAISS indicesdataset.add_elasticsearch_index(column, host=None, port=None, ...)-- builds an ElasticSearch index from a text columndataset.load_elasticsearch_index(index_name, es_index_name, ...)-- loads an existing ElasticSearch indexdataset.search(index_name, query, k=10)/dataset.search_batch(index_name, queries, k=10)-- query either index typedataset.get_nearest_examples(index_name, query, k=10)/dataset.get_nearest_examples_batch(index_name, queries, k=10)-- retrieve actual examples
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | CPU (FAISS-CPU) or NVIDIA GPU (FAISS-GPU) | GPU is optional; CPU is the default |
| External Service | ElasticSearch server (default: localhost:9200) | Only required for ElasticSearch index; not needed for FAISS |
| Python | numpy | Already a core dependency of the datasets library |
Dependencies
Python Packages
faiss-cpu>= 1.8.0.post1 (orfaiss-gpufor GPU support)elasticsearch>= 7.17.12, < 8.0.0
Both are detected at runtime via importlib.util.find_spec():
# src/datasets/search.py
_has_faiss = importlib.util.find_spec("faiss") is not None
_has_elasticsearch = importlib.util.find_spec("elasticsearch") is not None
Credentials
No credentials are required for FAISS. ElasticSearch may require authentication depending on server configuration. The ElasticSearchIndex constructor accepts an optional es_client parameter, allowing a pre-configured elasticsearch.Elasticsearch instance with custom authentication to be passed in.
Quick Install
# For FAISS (CPU)
pip install faiss-cpu>=1.8.0.post1
# For FAISS (GPU via conda -- recommended for GPU support)
conda install -c pytorch faiss-gpu
# For FAISS (GPU via pip -- community package, may not have latest version)
pip install faiss-gpu
# For ElasticSearch
pip install "elasticsearch>=7.17.12,<8.0.0"
Code Evidence
FAISS Detection and Error
# src/datasets/search.py, line 30
_has_faiss = importlib.util.find_spec("faiss") is not None
# src/datasets/search.py, lines 248-253 (FaissIndex.__init__)
if not _has_faiss:
raise ImportError(
"You must install Faiss to use FaissIndex. To do so you can run "
"`conda install -c pytorch faiss-cpu` or `conda install -c pytorch faiss-gpu`. "
"A community supported package is also available on pypi: "
"`pip install faiss-cpu` or `pip install faiss-gpu`. "
"Note that pip may not have the latest version of FAISS, and thus, "
"some of the latest features and bug fixes may not be available."
)
FAISS GPU Device Handling
# src/datasets/search.py, lines 316-347 (FaissIndex._faiss_index_to_device)
@staticmethod
def _faiss_index_to_device(index, device=None):
if device is None:
return index
import faiss
if isinstance(device, int):
if device > -1:
faiss_res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(faiss_res, device, index)
else:
index = faiss.index_cpu_to_all_gpus(index)
elif isinstance(device, (list, tuple)):
index = faiss.index_cpu_to_gpus_list(index, gpus=list(device))
return index
ElasticSearch Detection and Error
# src/datasets/search.py, line 29
_has_elasticsearch = importlib.util.find_spec("elasticsearch") is not None
# src/datasets/search.py, lines 116-119 (ElasticSearchIndex.__init__)
if not _has_elasticsearch:
raise ImportError(
"You must install ElasticSearch to use ElasticSearchIndex. "
"To do so you can run `pip install elasticsearch==7.7.1 for example`"
)
setup.py Version Constraints
# setup.py, lines 167-168
"elasticsearch>=7.17.12,<8.0.0", # 8.0 asks users to provide hosts or cloud_id
"faiss-cpu>=1.8.0.post1", # Pins numpy < 2
# setup.py, lines 193-194
NUMPY2_INCOMPATIBLE_LIBRARIES = [
"faiss-cpu",
]
Common Errors
| Error | Trigger | Resolution |
|---|---|---|
ImportError: You must install Faiss to use FaissIndex. To do so you can run `conda install -c pytorch faiss-cpu` or `conda install -c pytorch faiss-gpu`. A community supported package is also available on pypi: `pip install faiss-cpu` or `pip install faiss-gpu`. Note that pip may not have the latest version of FAISS, and thus, some of the latest features and bug fixes may not be available.
|
Calling FaissIndex() or dataset.add_faiss_index() without faiss installed
|
Install faiss-cpu or faiss-gpu
|
ImportError: You must install ElasticSearch to use ElasticSearchIndex. To do so you can run `pip install elasticsearch==7.7.1 for example`
|
Calling ElasticSearchIndex() or dataset.add_elasticsearch_index() without elasticsearch installed
|
Install elasticsearch>=7.17.12,<8.0.0
|
ValueError: Please specify either `es_client` or `(host, port)`, but not both.
|
Passing both es_client and host/port to ElasticSearchIndex
|
Use one connection method only |
TypeError: The argument type: ... is not expected. Please pass in either nothing, a positive int, a negative int, or a list of positive ints.
|
Passing an invalid device type to FaissIndex
|
Use None, an int, or a list of ints for device
|
ValueError: Index size should match Dataset size
|
Loading a saved FAISS index whose size does not match the current dataset length | Ensure the FAISS index was built from the same dataset |
Compatibility Notes
faiss-cpuis listed inNUMPY2_INCOMPATIBLE_LIBRARIESinsetup.py-- it pinsnumpy < 2and is excluded from NumPy 2 test runs- FAISS GPU requires an NVIDIA GPU with appropriate CUDA drivers; the
faiss-gpuconda package from thepytorchchannel is recommended over pip - ElasticSearch 8.x is not supported due to the
< 8.0.0version constraint (ES 8.0 changed the client constructor to requirehostsorcloud_id) - ElasticSearch
< 7.17.12is excluded because versions before 7.9.1 contained legacynumpy.float_usage