Implementation:Microsoft DeepSpeedExamples BingBert Turing FileUtils
| Knowledge Sources | |
|---|---|
| Domains | Caching, File Management |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A caching and download utility with Python 2/3 compatibility that provides URL-to-local-path resolution for HTTP, HTTPS, and S3 resources with ETag-based cache management.
Description
file_utils.py (in the turing package) is a variant of the pretrained BERT file utilities adapted for broader Python compatibility. It wraps urlparse and Path imports in try/except blocks to support both Python 2 and Python 3 environments. When pathlib.Path is unavailable or Path.home() raises an AttributeError, it falls back to os.path.expanduser for determining the cache directory location.
The module provides the same core caching functionality as its pytorch_pretrained_bert counterpart: cached_path resolves URLs or file paths to local cached files, url_to_filename generates SHA-256-based cache keys from URLs and ETags, and get_from_cache handles the download-and-cache workflow. However, this version uses EnvironmentError (compatible with Python 2) instead of FileNotFoundError for error handling, and explicitly passes encoding="utf-8" to file operations for cross-platform reliability.
The S3 support is implemented through boto3 with the s3_request decorator for error handling, and HTTP downloads use requests with streaming and tqdm progress reporting. The utility functions read_set_from_file and get_file_extension are also included, providing text collection loading and path extension extraction respectively. Python version checks using sys.version_info gate Path-to-string conversions.
Usage
Use this module in the Turing BERT pipeline when Python 2 compatibility is required or when operating in the Turing-specific model loading context. It serves the same purpose as the pytorch_pretrained_bert file_utils but with broader runtime compatibility guarantees.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/bing_bert/turing/file_utils.py
- Lines: 1-256
Signature
PYTORCH_PRETRAINED_BERT_CACHE = Path(
os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
Path.home() / '.pytorch_pretrained_bert'))
def url_to_filename(url, etag=None):
def filename_to_url(filename, cache_dir=None):
def cached_path(url_or_filename, cache_dir=None):
def split_s3_path(url):
def s3_request(func):
def s3_etag(url):
def s3_get(url, temp_file):
def http_get(url, temp_file):
def get_from_cache(url, cache_dir=None):
def read_set_from_file(filename):
def get_file_extension(path, dot=True, lower=True):
Import
from turing.file_utils import cached_path, PYTORCH_PRETRAINED_BERT_CACHE
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| url_or_filename | str | Yes | A URL (http/https/s3) or local file path to resolve |
| cache_dir | str or Path | No | Override cache directory; defaults to PYTORCH_PRETRAINED_BERT_CACHE |
| url | str | Yes | URL for download functions (s3_get, http_get, s3_etag) |
| etag | str | No | HTTP ETag for cache key generation in url_to_filename |
| filename | str | Yes | Cache filename for reverse URL lookup in filename_to_url |
Outputs
| Name | Type | Description |
|---|---|---|
| cached_path result | str | Local filesystem path to the cached or verified local file |
| url_to_filename result | str | SHA-256 hash-based filename for cache storage |
| filename_to_url result | tuple(str, str) | Original URL and ETag recovered from JSON metadata sidecar |
| read_set_from_file result | set[str] | Deduplicated set of lines read from a text file |
Usage Examples
from turing.file_utils import cached_path, url_to_filename, read_set_from_file
# Resolve a model URL to a local cached file
local_path = cached_path("https://example.com/bert-base-uncased.tar.gz")
# Generate a deterministic cache filename
filename = url_to_filename("https://example.com/model.bin", etag='"abc123"')
# Load a set of items from a text file
vocab_set = read_set_from_file("/path/to/vocab.txt")