toddr · FGasper · Dec 12, 2020 · Dec 12, 2020 · Dec 12, 2020 · Dec 12, 2020
diff --git a/CDB_File.pm b/CDB_File.pm
@@ -16,18 +16,20 @@ CDB_File - Perl extension for access to cdb databases
 =head1 SYNOPSIS
 
     use CDB_File;
-    $c = tie(%h, 'CDB_File', 'file.cdb') or die "tie failed: $!\n";
 
-    # If accessing a utf8 stored CDB_File
-    $c = tie(%h, 'CDB_File', 'file.cdb', utf8 => 1) or die "tie failed: $!\n";
+    # If accessing a bytes/Latin-1 CDB file:
+    $c = tie(%h, 'CDB_File', 'file.cdb', string_mode => 'latin1') or die "tie failed: $!\n";
+
+    # If accessing a utf8 stored CDB file:
+    $c = tie(%h, 'CDB_File', 'file.cdb', string_mode => 'utf8') or die "tie failed: $!\n";
 
     $fh = $c->handle;
     sysseek $fh, $c->datapos, 0 or die ...;
     sysread $fh, $x, $c->datalen;
     undef $c;
     untie %h;
 
-    $t = CDB_File->new('t.cdb', "t.$$") or die ...;
+    $t = CDB_File->new('t.cdb', "t.$$", string_mode => 'latin1') or die ...;
     $t->insert('key', 'value');
     $t->finish;
 
@@ -36,10 +38,11 @@ CDB_File - Perl extension for access to cdb databases
 or
 
     use CDB_File 'create';
-    create %t, $file, "$file.$$";
+    create %t, $file, "$file.$$", string_mode => 'latin1';
+
+    # If you want to store the data UTF-8 encoded:
+    create %t, $file, "$file.$$", string_mode => 'utf8';
 
-    # If you want to store the data in utf8 mode.
-    create %t, $file, "$file.$$", utf8 => 1;
 =head1 DESCRIPTION
 
 B<CDB_File> is a module which provides a Perl interface to Dan
@@ -86,25 +89,54 @@ C<$final> containing the contents of C<%t>.  As before,  C<$tmp> must
 name a temporary file which can be atomically renamed to C<$final>.
 C<CDB_File::create> may be imported.
 
-=head2 UTF8 support.
-
-When CDB_File was created in 1997 (prior even to Perl 5.6), Perl SVs
-didn't really deal with UTF8. In order to properly store mixed
-bytes and utf8 data in the file, we would normally need to store a bit
-for each string which clarifies the encoding of the key / values.
-This would be useful since Perl hash keys are downgraded to bytes when
-possible so as to normalize the hash key access regardless of encoding.
-
-The CDB_File format is used outside of Perl and so must maintain file
-format compatibility with those systems. As a result this module provides
-a utf8 mode which must be enabled at database generation and then later
-at read. Keys will always be stored as UTF8 strings which is the opposite
-of how Perl stores the strings. This approach had to be taken to assure no
-data corruption happened due to accidentally downgraded SVs before they
-are stored or on retrieval.
-
-You can enable utf8 mode by passing C<utf8 =E<gt> 1> to B<new>, B<tie>,
-or B<create>. All returned SVs while in this mode will be encoded in utf8.
+=head2 String Modes
+
+When CDB_File was created in 1997 (prior even to Perl 5.6), Perl strings
+were simple byte strings. It thus made sense, when exporting strings,
+simply to save the Perl interpreter’s internal string representation.
+
+In modern perls, though, strings are ordered arrays of code points.
+Perl doesn’t store those code points in a predictable internal encoding;
+thus, if we use the old behavior of exporting Perl’s internal
+representation, we’ll have unpredictable results.
+
+Sadly, this status quo must remain our default behavior; however, newer
+code should fix the situation by passing a C<string_mode> parameter
+to C<new()>, C<tie()>, or C<create()> with one of the following values:
+
+=over
+
+=item * C<latin1> - Similar to legacy behavior, but all strings are
+saved and imported as Latin-1. Any attempt to save a string that contains
+a code point that Latin-1 can’t accommodate—i.e., a code point that exceeds
+255—will trigger an exception.
+
+Likewise, any lookup on a string that includes a >255 code point will
+trigger an exception.
+
+This is suitable for “byte strings”, i.e., strings whose code points
+represent raw octets. This is the default state for Perl strings, and
+it’s also what you’ll have if you’ve encoded your strings for output
+prior to sending them to CDB_File.
+
+=item * C<utf8> - All strings are stored as UTF-8. Additionally, when
+reading a CDB file, all strings are also I<decoded> as UTF-8. Any strings
+in the CDB file that may not be valid UTF-8 will trigger an exception.
+This is suitable for decoded strings that you have I<not> encoded prior to
+sending them to CDB_File.
+
+=item * C<utf8_naive> - Just like C<utf8>, but this skips the UTF-8
+validity check. This can be marginally faster than C<utf8>, but if any
+strings are invalid UTF-8 then Perl’s internals may be corrupted. Avoid
+this mode unless you trust what you’re loading.
+
+=back
+
+You can also pass a C<string_mode> of C<sv> to indicate the legacy
+behavior explicitly.
+
+Note that the above applies to hash keys as well as to values.
+
 This feature is not available below 5.14 due to lack of Perl macro support.
 
 B<NOTE:> read/write of databases not stored in utf8 mode will often be