Skip to content

Conversation

ia0
Copy link
Owner

@ia0 ia0 commented Jul 27, 2025

Fixes #145

Timeline:

  • 2025-08-18: The PR will be merged if no issues are raised (and as soon as I get a chance to do it).
  • 2025-09-08: This PR is released as part of the publication of data-encoding-v2.10.0.

@zacknewman
Copy link

zacknewman commented Jul 30, 2025

Out of curiosity, where are these "magic" numbers coming from? Is there any concern of adding a new format that suffers from overflow for values smaller than what is asserted (i.e., is there some mechanism for which these limits are ensured to be a lower bound for any new format)?

@ia0
Copy link
Owner Author

ia0 commented Jul 31, 2025

Out of curiosity, where are these "magic" numbers coming from?

I'm assuming you mean usize::MAX / 512 for encode_len() and usize::MAX / 8 for decode_len(). Those are just arbitrary lower bounds to the actual maximum input length (as defined by the implementation). I'm using them for 3 reasons:

  • They are much simpler to understand than the actual formula. For example, for encode_len() it would be something like: if self uses wrapping, the maximum input length is usize::MAX / (col + sep) * col / 8 * bit (not exact, just an example), otherwise if self uses padding, the maximum input length is usize::MAX / dec * enc, otherwise the maximum input length is usize::MAX / 8 * bit.
  • They are much faster to check. Otherwise, calling encode_len() would spend as much time figuring out the maximum input length that computing the output length.
  • They are not leaking implementation details. The specification is essentially "You should not call this function with unreasonably big numbers" (otherwise the input should be chunked). Those arbitrary values are just a way to say what unreasonable means. Ideally I would have used 1_000_000. But the problem is that it needs to depend on usize::MAX to be portable (unless I ignore architectures with usize::BITS < 32 which could be an option). Another alternative would be min(1_000_000, usize::MAX / K) with a good K.

I'll have to ponder those alternatives, possibly using a mix like specifying 1_000_000 as the max, with a footnote that it only applies when 32 <= usize::BITS, otherwise it's usize::MAX / K.

Is there any concern of adding a new format that suffers from overflow for values smaller than what is asserted (i.e., is there some mechanism for which these limits are ensured to be a lower bound for any new format)?

There's no concern. Those bounds are correct for all possible Encodings. This PR adds 2 fuzzing targets to ensure this property (assuming monotonicity of encode_len() and decode_len()). But I've also wrote some tests (not in the PR) to compute the exact maximum value for all Encodings. That's when I decided that the actual exact maximum value is not user-friendly, and also not something I want to guarantee. So I ended up using a lower bound independent of the Encoding.

@zacknewman
Copy link

zacknewman commented Jul 31, 2025

I'm assuming you mean usize::MAX / 512 for encode_len() and usize::MAX / 8 for decode_len().

Yes.

I think it's fine as is. While a number like a million is nice since it's a power of 10, it's still just as arbitrary. Additionally, I don't think it makes sense to be more limiting than necessary. If a person wants to work with a large buffer (e.g., 100 MiB on an x64 machine), then they should be allowed to so long as the code is still correct.

Something like usize::MAX / 512 doesn't really leak implementation details, and honestly it might be fine to be vague in the panic section. Perhaps: panics when the input is unrealistically large. Here you're at least mentioning that panics are possible, and users can always read the source code if they want to know more precisely what values lead to a panic. Personally I would prefer the value be mentioned this way I can explicitly test for the length to avoid a panic and be guaranteed that my check is correct since it's part of the public API, but I understand not wanting to "leak" this limit or be beholden to it in the future.

They are much faster to check. Otherwise, calling encode_len() would spend as much time figuring out the maximum input length that computing the output length.

I guess I'm not used to this level of performance concern. If you are worried about this level of performance, wouldn't it be better to define const variables that define these limits ensuring that the calculations are evaluated at compilation time? I'm sure the optimizer evaluates something like usize::MAX / 512 (or equivalently usize::MAX >> 9) at compilation time, but it's technically not guaranteed unlike the use of a const variable.

Anyway, I was trying to confirm my "guess" that these magic values were just "convenient" and not too unnecessarily limiting.

Alternatively, you can perform checked arithmetic where overflow is possible (e.g., usize::checked_mul), and unwrap/expect the result; but I'm guessing you're worried about the performance of that.

@ia0
Copy link
Owner Author

ia0 commented Jul 31, 2025

If a person wants to work with a large buffer (e.g., 100 MiB on an x64 machine), then they should be allowed to so long as the code is still correct.

Fair point. So I'll keep the PR as is. The current limits are close enough to the actual ones (except for encode_len() due to how wrapping permits unreasonably long separators, but I'll try to fix that in v3), but still permit some flexibility in implementations. This actually reminds me that I should document that those limits may be increased without being a major change (i.e. the panic is not guaranteed behavior, code previously panicking may start returning the correct value).

wouldn't it be better to define const variables that define these limits ensuring that the calculations are evaluated at compilation time?

I'm not sure if you're talking about the precise limits or those of the PR. I was talking about the precise limits when talking about complexity. And for those you can't compute the limits at compile-time because they depend on the Encoding and const fn in Rust today are not expressive enough for this kind of things, and this would only work if the function is called in const context.

Anyway, I was trying to confirm my "guess" that these magic values were just "convenient" and not too unnecessarily limiting.

Exactly. Those limits strike a balance between "close enough to the implementation limit" and "simple enough for users and low enough to permit future implementations". I'm planning to improve the encode_len() one in v3, but in v2 it has to be below usize::MAX / 263 for which there exists unrealistic Encodings for which this is the maximum input length.

Alternatively, you can perform checked arithmetic where overflow is possible (e.g., usize::checked_mul), and unwrap/expect the result; but I'm guessing you're worried about the performance of that.

That's something I considered. But this still has 2 of the 3 drawbacks listed above. Namely the limit will be hard to understand (very long formula) and leaks implementation details (possibly preventing future implementations).

@zacknewman
Copy link

zacknewman commented Jul 31, 2025

Fair point. So I'll keep the PR as is. The current limits are close enough to the actual ones (except for encode_len() due to how wrapping permits unreasonably long separators, but I'll try to fix that in v3), but still permit some flexibility in implementations. This actually reminds me that I should document that those limits may be increased without being a major change (i.e. the panic is not guaranteed behavior, code previously panicking may start returning the correct value).

My experience with SemVer stuff is that going from panic -> not panic is not a major change in a similar vein as going from compilation failure -> compilation success (with all else being equal), but it doesn't hurt to be precise in stating this explicitly. As long as code that used to be panic-free doesn't suddenly become panicable, then I'm happy. I can always increase the limit if I decide to re-read updated documentation.

I'm not sure if you're talking about the precise limits or those of the PR. I was talking about the precise limits when talking about complexity. And for those you can't compute the limits at compile-time because they depend on the Encoding and const fn in Rust today are not expressive enough for this kind of things, and this would only work if the function is called in const context.

I was being somewhat hyperbolic with my analogy. I was stating that if calculating the precise limit is "too expensive", then why not go further and think usize::MAX / 512 is "too expensive" as well? I realize the precise limit is more "expensive", but it's all relative. To some, the precise limit is likely fast enough. Point being, if a goal is to satisfy everyone's notion of "fast enough"; then I'm sure there is a (tiny, but non-empty) subset of people that think usize::MAX / 512 is "too slow" when a literal could be used instead.

That's something I considered. But this still has 2 of the 3 drawbacks listed above. Namely the limit will be hard to understand (very long formula) and leaks implementation details (possibly preventing future implementations).

Well if we are already conceding that the cited limit is subject to change to a higher value without SemVer implications, then I'm not sure if those drawbacks really apply. In v2 things are a little tricky since it's reasonable—even necessary for people like me who intend to avoid panics—for downstream code to first verify inputs are not too long so having an "easy" limit to check is nice. For v3 if a Result is returned instead of a panic, then downstream code doesn't need (and I'd argue shouldn't) check for these limits. One could then state something vague like:

Errors

Errors iff the calculation of the output length causes overflow.

You can always provide a separate function that does panic, but it would simply state:

Panics

panics iff <equivalent fallible function> errors.

"Freeing" you from citing any implementation-based limits since it'd just be a convenience function that allows one to avoid calling unwrap.

The last choice is to re-define the length calculations such that overflow is not possible. Not sure how easy that is to do generally, but it's quite easy for specific encodings as the cited issue shows for base64url without padding. The algorithms for encoded length and decoded length were the algorithms I was using in my own code, but they're actually not correct since data-encoding has overflow for smaller values. I even wrote a mathematical proof of their correctness and why they're free from overflow (in the case when one leverages Rust's guarantee that a memory allocation never exceeds isize::MAX). Seeing how base16 is the "smallest" base that is supported and 2 * isize::MAX < usize::MAX1, it may be possible to implement a general algorithm whose intermediate calculations are free from overflow.

Footnotes

  1. We assume usize::MAX is of the form 2ⁱ - 1 and isize::MAX is of the form 2^(i - 1) - 1 for i ∈ {16, 17, …} ⊂ ℕ where 16 is the minimum i due to usize implementing From<u16>.

@ia0
Copy link
Owner Author

ia0 commented Jul 31, 2025

My experience with SemVer stuff is that going from panic -> not panic is not a major change

I was also of this opinion, but due to #126 I asked the community, and sadly that's not the general opinion. If a panic is documented, then it's a "positive" behavior.

I was stating that if calculating the precise limit is "too expensive", then why not go further and think usize::MAX / 512 is "too expensive" as well?

I see. Then I don't think this holds. usize::MAX / 512 is a constant. The precise limit is conditional on multiple dynamic values (wrapping and padding), besides doing much more arithmetic. There is a very clear distinction between their computational behavior.

For v3 if a Result is returned instead of a panic, then downstream code doesn't need (and I'd argue shouldn't) check for these limits.

That's a very good observation. In v3 I don't even need to document the limit. Returning an error is enough. But I can still provide a limit below which it will never error due to overflow (essentially some usize::MAX / K where K is large enough to support future implementations but small enough to be useful).

The last choice is to re-define the length calculations such that overflow is not possible.

That's simply not possible for encode_len() since the output length is longer than the input length by at least a 8 / bit factor. For decode_len() though it should be possible.

@zacknewman
Copy link

zacknewman commented Jul 31, 2025

I was also of this opinion, but due to #126 I asked the community, and sadly that's not the general opinion. If a panic is documented, then it's a "positive" behavior.

Hm, interesting. Like most things, SemVer has a fair amount of subjectivity. The fact that you already have a generic "invalid length" variant for DecodeKind makes me disagree that changing the panic to return an error based on DecodeKind::Length is a "breaking change" as long as the panic documentation were changed. I can understand why some people think otherwise, but I don't. If you had to add a new variant to DecodeKind, then I'd think otherwise since it's not marked #[non_exhaustive]. I'll also add that that example is a little different since you're going from panic to not panicking at all while I was remarking on going from panic for one set of inputs to a panic for a proper subset of the original inputs. Granted, I still think your example is exempt from SemVer; but I respect your opinion to not do so.

I see. Then I don't think this holds. usize::MAX / 512 is a constant. The precise limit is conditional on multiple dynamic values (wrapping and padding), besides doing much more arithmetic. There is a very clear distinction between their computational behavior.

While usize::MAX / 512 may end up compiling to a constant (especially in release mode), there is no guarantee that arithmetic of constants always compiles to a constant; hence why one must ensure you're in a const context if one wants to be guaranteed of that. I obviously don't think there is a performance issue even if the compiler didn't compile it to a constant; I was merely remarking that performing division may be considered "too slow" to some. You can read more in The Reference about what is and is not guaranteed to be evaluated at compilation time.

That's simply not possible for encode_len() since the output length is longer than the input length by at least a 8 / bit factor. For decode_len() though it should be possible.

I don't follow. When encode_len is called by a function like encode, then its input is guaranteed to be no more than isize::MAX. The maximum length is thus isize::MAX * 2 since base16 is the "smallest" base, and isize::MAX * 2 < usize::MAX. If you expect to generalize to even smaller formats, then yes overflow is possible and you need to error (or panic). One can calculate the length based on octets instead of bits at least for the currently supported formats. For something like base64url without padding, the below is mathematically guaranteed to be free from overflow and a panic (so long as the argument passed is less than or equal to isize::MAX which is true for functions that are based "collections" since the resulting length never exceeds isize::MAX).

const fn base64url_nopad_encode_len(n: usize) -> usize {
    const MAX_LEN: usize = isize::MAX as usize;
    assert!(n <= MAX_LEN);
    ((n / 3) << 2) + ((n % 3) << 2).div_ceil(3)
}

@ia0
Copy link
Owner Author

ia0 commented Jul 31, 2025

since base16 is the "smallest" base

It's not. data-encoding supports base2 (binary), base4, base8 (octal), base16 (hexadecimal), base32, and base64. That's all streamable bases that work with printable ASCII (base128 would use control characters, base256 is essentially no encoding, and bases that are not a power of two are not streamable).

isize::MAX * 2 < usize::MAX

But is this useful? If you assume the input fits isize then it would make sense to consider the output to also fit isize, since both are meant to measure a number of bytes. So for encoding, there will always be an input length that would have an invalid output length, so that input length is unusable.

@zacknewman
Copy link

zacknewman commented Jul 31, 2025

It's not. data-encoding supports base2 (binary), base4, base8 (octal), base16 (hexadecimal), base32, and base64. That's all streamable bases that work with printable ASCII (base128 would use control characters, base256 is essentially no encoding, and bases that are not a power of two are not streamable).

Ah, I was lazily going by the constants that are defined (e.g., BASE32); and the smallest formats that have an associated constant are the HEX* ones.

But is this useful? If you assume the input fits isize then it would make sense to consider the output to also fit isize, since both are meant to measure a number of bytes. So for encoding, there will always be an input length that would have an invalid output length, so that input length is unusable.

Well the point is moot since you correctly pointed out you support smaller formats than base16; nonetheless I'll reply. You currently don't remark anywhere that panics can occur due to memory allocations. This is very common in crates, and it's something I also don't do. After all even if you can ensure that you're allocating for no more than isize::MAX bytes, that doesn't mean there will be enough runtime memory to support it. Currently associated functions like Vec::try_with_capacity are unstable. This means one can take advantage of this fact by correctly calculating the encoded length even if the resulting allocation is guaranteed to panic. Now if you documented how panics can occur due to memory allocations, then you are correct that there is no point in calculating output lengths that exceed isize::MAX. If you don't document such behavior (which is possible as soon as you rely on pretty much anything in alloc thus making many functions panicable), then you can get away with no additional panics/errors that can occur beyond the omnipresent memory-allocation-based panics that many crate authors ignore for good reason.

@ia0
Copy link
Owner Author

ia0 commented Jul 31, 2025

Well the point is moot

That's exactly why I'm only giving a simple limit. I don't think anything close to those limits is realistic. It's just technically correct to have a limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document length limits for which functions are guaranteed to produce the correct result
2 participants