Merge pull request #1944 from LukasKalbertodt/cr-lf-fixes

ehuss · web-flow · commit 5b3ca00b50a5 · 2025-07-25T13:51:40.000Z
Fix and clarify CR LF normalization and CR in string literals
diff --git a/src/input-format.md b/src/input-format.md
@@ -24,6 +24,7 @@ r[input.crlf]
 ## CRLF normalization
 
 Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
+This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").
 
 Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
 
diff --git a/src/tokens.md b/src/tokens.md
@@ -60,8 +60,6 @@ Literals are tokens used in [literal expressions].
 
 [^nsets]: The number of `#`s on each side of the same literal must be equivalent.
 
-> [!NOTE]
-> Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
 
 #### ASCII escapes
 
@@ -198,9 +196,9 @@ which must be _escaped_ by a preceding `U+005C` character (`\`).
 
 r[lex.token.literal.str.linefeed]
 Line-breaks, represented by the  character `U+000A` (LF), are allowed in string literals.
+The character `U+000D` (CR) may not appear in a string literal.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
 
 r[lex.token.literal.char-escape]
 #### Character escapes
@@ -323,9 +321,9 @@ below.
 
 r[lex.token.str-byte.linefeed]
 Line-breaks, represented by the  character `U+000A` (LF), are allowed in byte string literals.
+The character `U+000D` (CR) may not appear in a byte string literal.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
 
 r[lex.token.str-byte.escape]
 Some additional _escapes_ are available in either byte or non-raw byte string
@@ -429,9 +427,9 @@ permitted within a C string.
 
 r[lex.token.str-c.linefeed]
 Line-breaks, represented by the  character `U+000A` (LF), are allowed in C string literals.
+The character `U+000D` (CR) may not appear in a C string literal.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
 
 r[lex.token.str-c.escape]
 Some additional _escapes_ are available in non-raw C string literals. An escape