Skip to content

Shorten cache filenames to fit eCryptfs 143-byte NAME_MAX#566

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/cache-filename-too-long
Open

Shorten cache filenames to fit eCryptfs 143-byte NAME_MAX#566
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/cache-filename-too-long

Conversation

@Chessing234
Copy link
Copy Markdown

Summary

  • url_to_filename() was appending the full trailing URL path component (e.g. tfidf_vectors_sparse.npz) to the hash-based filename, producing names up to 154 characters
  • This exceeds the 143-byte NAME_MAX on eCryptfs-encrypted filesystems (common on Ubuntu encrypted home directories), causing OSError: [Errno 36] File name too long
  • Now only the file extension is preserved (e.g. .npz), keeping the worst-case filename (including .json metadata sidecar) under 143 bytes
  • _find_existing_cache_file() matches both old-format and new-format filenames for backward compatibility — existing caches continue to work

Fixes #539, related to #447

Changes

  • scispacy/file_cache.py: url_to_filename() now appends only the file extension instead of the full trailing path component; added _find_existing_cache_file() helper that supports both old and new filename formats
  • tests/test_file_cache.py: Added test verifying all actual scispacy linker URLs produce filenames under the 143-byte limit

Test plan

  • python -m pytest tests/test_file_cache.py -v passes
  • Existing cached files (old format) are still found without re-download
  • New downloads produce shorter filenames that work on eCryptfs

🤖 Generated with Claude Code

url_to_filename() was appending the full trailing URL path component
(e.g. tfidf_vectors_sparse.npz) to the hash-based filename, producing
names up to 154 characters. This exceeds the 143-byte NAME_MAX on
eCryptfs-encrypted filesystems, causing OSError: File name too long.

Now only the file extension is preserved (e.g. .npz), keeping the
worst-case filename (including .json sidecar) under 143 bytes.
_find_existing_cache_file() matches both old and new filename formats
for backward compatibility.

Fixes allenai#539

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File name too long

1 participant