FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478
FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478subrata-ms wants to merge 7 commits intomainfrom
Conversation
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 3127-3138 3127 // the streaming LOB path to retrieve the full
3128 // value.
3129 LOG("SQLGetData: Cannot determine data length "
3130 "(SQL_NO_TOTAL) for column %d (SQL_CHAR), "
! 3131 "falling back to LOB streaming",
3132 i);
! 3133 row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3134 charEncoding));
3135 // LCOV_EXCL_STOP
3136 } else if (dataLen < 0) {
3137 LOG("SQLGetData: Unexpected negative data length "
3138 "for column %d - dataType=%d, dataLen=%ld",Lines 3274-3285 3274 // the streaming LOB path to retrieve the full
3275 // value.
3276 LOG("SQLGetData: Cannot determine NVARCHAR data "
3277 "length (SQL_NO_TOTAL) for column %d, "
! 3278 "falling back to LOB streaming",
3279 i);
! 3280 row.append(
! 3281 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
3282 // LCOV_EXCL_STOP
3283 } else if (dataLen < 0) {
3284 LOG("SQLGetData: Unexpected negative data length "
3285 "for column %d (NVARCHAR) - dataLen=%ld",Lines 4008-4029 4008 //
4009 // Coverage note: This branch is defensive — the ODBC driver
4010 // controls indicator values in bound-column mode, so
4011 // SQL_NO_TOTAL cannot be triggered from Python-level tests.
! 4012 const ColumnInfoExt& noTotalColInfo = columnInfosExt[col - 1];
! 4013 LOG("SQLGetData: SQL_NO_TOTAL for column %d (dataType=%d), "
! 4014 "falling back to LOB streaming",
! 4015 col, noTotalColInfo.dataType);
! 4016 switch (noTotalColInfo.dataType) {
! 4017 case SQL_CHAR:
! 4018 case SQL_VARCHAR:
! 4019 case SQL_LONGVARCHAR:
! 4020 #if defined(__APPLE__) || defined(__linux__)
! 4021 PyList_SET_ITEM(row, col - 1,
! 4022 FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false,
! 4023 noTotalColInfo.charEncoding)
! 4024 .release()
! 4025 .ptr());
4026 #else
4027 PyList_SET_ITEM(
4028 row, col - 1,
4029 FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")Lines 4029-4066 4029 FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
4030 .release()
4031 .ptr());
4032 #endif
! 4033 break;
! 4034 case SQL_WCHAR:
! 4035 case SQL_WVARCHAR:
! 4036 case SQL_WLONGVARCHAR:
! 4037 case SQL_SS_XML:
! 4038 PyList_SET_ITEM(
! 4039 row, col - 1,
! 4040 FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 4041 .release()
! 4042 .ptr());
! 4043 break;
! 4044 case SQL_BINARY:
! 4045 case SQL_VARBINARY:
! 4046 case SQL_LONGVARBINARY:
! 4047 case SQL_SS_UDT:
! 4048 PyList_SET_ITEM(row, col - 1,
! 4049 FetchLobColumnData(hStmt, col, SQL_C_BINARY, false, true)
! 4050 .release()
! 4051 .ptr());
! 4052 break;
! 4053 default:
4054 // For fixed-length types SQL_NO_TOTAL should not
4055 // occur; treat as NULL to avoid undefined behaviour.
! 4056 LOG("SQL_NO_TOTAL for unexpected fixed-type column %d "
! 4057 "(dataType=%d), returning NULL",
! 4058 col, noTotalColInfo.dataType);
! 4059 Py_INCREF(Py_None);
! 4060 PyList_SET_ITEM(row, col - 1, Py_None);
! 4061 break;
! 4062 }
4063 // LCOV_EXCL_STOP
4064 continue;
4065 }📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%🔗 Quick Links
|
There was a problem hiding this comment.
Pull request overview
This PR addresses cross-platform inconsistencies when fetching CP1252-collated VARCHAR data by changing the ODBC C-type used for VARCHAR retrieval on Windows, aiming to ensure consistent Unicode (str) results across Windows and Linux/macOS.
Changes:
- On Windows, fetch/bind
VARCHAR(and relatedSQL_CHARtypes) asSQL_C_WCHARand process via the wide-char path to avoid code page guessing and bytes fallback. - On Linux/macOS, clarify/standardize
SQL_C_CHARbuffer sizing assumptions for UTF-8 expansion. - Add a new CP1252 regression test section covering problematic bytes and multiple fetch paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
mssql_python/pybind/ddbc_bindings.cpp |
Updates SQLGetData_wrap and batch fetch binding/processing to use SQL_C_WCHAR for VARCHAR on Windows; adjusts buffer sizing logic and LOB detection formatting. |
tests/test_013_encoding_decoding.py |
Adds a new CP1252 VARCHAR consistency test suite intended to ensure VARCHAR returns str across platforms and fetch methods. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
ffelixg
left a comment
There was a problem hiding this comment.
Hey, love to see this!
The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:
def test_locale_varchar_decode_iso885915():
assert locale.getlocale() == ('en_US', 'UTF-8')
# important: change locale BEFORE connecting
locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
connection = connect()
cursor = connection.cursor()
(val,) = cursor.execute("SELECT '€'").fetchone()
# without changing locale, we would get val == '€'
assert val == b'\xa4', val
cursor.close()
connection.close()If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.
On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.
Work Item / Issue Reference
Summary
This pull request makes significant improvements to how VARCHAR (and related CHAR) columns are fetched and decoded, especially for Windows platforms, to ensure consistent Unicode string handling across all operating systems. The changes unify the decoding behavior for problematic CP1252 byte values, preventing platform-specific inconsistencies and ensuring that VARCHAR columns always return Python
strobjects, notbytes, regardless of the server collation or platform. The update also adds comprehensive tests to verify this behavior.Cross-platform VARCHAR decoding and buffer management:
SQL_C_WCHAR, ensuring the ODBC driver converts from the server's native encoding (such as CP1252) to UTF-16, matching the behavior of NVARCHAR columns and eliminating the need to guess the server code page. This change guarantees that Python always receives Unicode strings (str) for VARCHAR columns, even when they contain bytes that are invalid in UTF-8. [1] [2] [3] [4]SQL_C_CHARand decode as UTF-8, but the code is refactored for clarity and consistency with the Windows path. [1] [2] [3] [4] [5]Testing improvements:
str(notbytes) on all platforms and via all fetch methods (fetchone,fetchall,fetchmany).These changes ensure robust, predictable Unicode handling for VARCHAR columns across all supported platforms, and the expanded test suite helps prevent regressions in encoding/decoding behavior.