Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples BingBert Turing FileUtils

From Leeroopedia


Knowledge Sources
Domains Caching, File Management
Last Updated 2026-02-07 12:00 GMT

Overview

A caching and download utility with Python 2/3 compatibility that provides URL-to-local-path resolution for HTTP, HTTPS, and S3 resources with ETag-based cache management.

Description

file_utils.py (in the turing package) is a variant of the pretrained BERT file utilities adapted for broader Python compatibility. It wraps urlparse and Path imports in try/except blocks to support both Python 2 and Python 3 environments. When pathlib.Path is unavailable or Path.home() raises an AttributeError, it falls back to os.path.expanduser for determining the cache directory location.

The module provides the same core caching functionality as its pytorch_pretrained_bert counterpart: cached_path resolves URLs or file paths to local cached files, url_to_filename generates SHA-256-based cache keys from URLs and ETags, and get_from_cache handles the download-and-cache workflow. However, this version uses EnvironmentError (compatible with Python 2) instead of FileNotFoundError for error handling, and explicitly passes encoding="utf-8" to file operations for cross-platform reliability.

The S3 support is implemented through boto3 with the s3_request decorator for error handling, and HTTP downloads use requests with streaming and tqdm progress reporting. The utility functions read_set_from_file and get_file_extension are also included, providing text collection loading and path extension extraction respectively. Python version checks using sys.version_info gate Path-to-string conversions.

Usage

Use this module in the Turing BERT pipeline when Python 2 compatibility is required or when operating in the Turing-specific model loading context. It serves the same purpose as the pytorch_pretrained_bert file_utils but with broader runtime compatibility guarantees.

Code Reference

Source Location

Signature

PYTORCH_PRETRAINED_BERT_CACHE = Path(
    os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
              Path.home() / '.pytorch_pretrained_bert'))

def url_to_filename(url, etag=None):
def filename_to_url(filename, cache_dir=None):
def cached_path(url_or_filename, cache_dir=None):
def split_s3_path(url):
def s3_request(func):
def s3_etag(url):
def s3_get(url, temp_file):
def http_get(url, temp_file):
def get_from_cache(url, cache_dir=None):
def read_set_from_file(filename):
def get_file_extension(path, dot=True, lower=True):

Import

from turing.file_utils import cached_path, PYTORCH_PRETRAINED_BERT_CACHE

I/O Contract

Inputs

Name Type Required Description
url_or_filename str Yes A URL (http/https/s3) or local file path to resolve
cache_dir str or Path No Override cache directory; defaults to PYTORCH_PRETRAINED_BERT_CACHE
url str Yes URL for download functions (s3_get, http_get, s3_etag)
etag str No HTTP ETag for cache key generation in url_to_filename
filename str Yes Cache filename for reverse URL lookup in filename_to_url

Outputs

Name Type Description
cached_path result str Local filesystem path to the cached or verified local file
url_to_filename result str SHA-256 hash-based filename for cache storage
filename_to_url result tuple(str, str) Original URL and ETag recovered from JSON metadata sidecar
read_set_from_file result set[str] Deduplicated set of lines read from a text file

Usage Examples

from turing.file_utils import cached_path, url_to_filename, read_set_from_file

# Resolve a model URL to a local cached file
local_path = cached_path("https://example.com/bert-base-uncased.tar.gz")

# Generate a deterministic cache filename
filename = url_to_filename("https://example.com/model.bin", etag='"abc123"')

# Load a set of items from a text file
vocab_set = read_set_from_file("/path/to/vocab.txt")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment