Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478

Open
subrata-ms wants to merge 7 commits intomainfrom
subrata-ms/cp1252_encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478
subrata-ms wants to merge 7 commits intomainfrom
subrata-ms/cp1252_encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Mar 19, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request makes significant improvements to how VARCHAR (and related CHAR) columns are fetched and decoded, especially for Windows platforms, to ensure consistent Unicode string handling across all operating systems. The changes unify the decoding behavior for problematic CP1252 byte values, preventing platform-specific inconsistencies and ensuring that VARCHAR columns always return Python str objects, not bytes, regardless of the server collation or platform. The update also adds comprehensive tests to verify this behavior.

Cross-platform VARCHAR decoding and buffer management:

  • On Windows, VARCHAR columns are now always fetched as SQL_C_WCHAR, ensuring the ODBC driver converts from the server's native encoding (such as CP1252) to UTF-16, matching the behavior of NVARCHAR columns and eliminating the need to guess the server code page. This change guarantees that Python always receives Unicode strings (str) for VARCHAR columns, even when they contain bytes that are invalid in UTF-8. [1] [2] [3] [4]
  • On Linux/macOS, the logic for buffer allocation and decoding remains unchanged, continuing to fetch as SQL_C_CHAR and decode as UTF-8, but the code is refactored for clarity and consistency with the Windows path. [1] [2] [3] [4] [5]

Testing improvements:

  • Adds 23 new parameterized tests to verify that all problematic CP1252 byte values in VARCHAR columns are always decoded to str (not bytes) on all platforms and via all fetch methods (fetchone, fetchall, fetchmany).
  • Adds tests to ensure that embedded problematic bytes and explicit CP1252 decoding are handled correctly, further validating the new cross-platform consistency.
  • Updates the test count in the encoding/decoding test module to reflect the increased coverage.

These changes ensure robust, predictable Unicode handling for VARCHAR columns across all supported platforms, and the expanded test suite helps prevent regressions in encoding/decoding behavior.

@github-actions github-actions bot added the pr-size: medium Moderate update size label Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

📊 Code Coverage Report

🔥 Diff Coverage

11%


🎯 Overall Coverage

77%


📈 Total Lines Covered: 5759 out of 7465
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (11.1%): Missing lines 3131,3133-3134,3278,3280-3281,4012-4025,4033-4053,4056-4062

Summary

  • Total: 54 lines
  • Missing: 48 lines
  • Coverage: 11%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3127-3138

  3127                             // the streaming LOB path to retrieve the full
  3128                             // value.
  3129                             LOG("SQLGetData: Cannot determine data length "
  3130                                 "(SQL_NO_TOTAL) for column %d (SQL_CHAR), "
! 3131                                 "falling back to LOB streaming",
  3132                                 i);
! 3133                             row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3134                                                           charEncoding));
  3135                             // LCOV_EXCL_STOP
  3136                         } else if (dataLen < 0) {
  3137                             LOG("SQLGetData: Unexpected negative data length "
  3138                                 "for column %d - dataType=%d, dataLen=%ld",

Lines 3274-3285

  3274                             // the streaming LOB path to retrieve the full
  3275                             // value.
  3276                             LOG("SQLGetData: Cannot determine NVARCHAR data "
  3277                                 "length (SQL_NO_TOTAL) for column %d, "
! 3278                                 "falling back to LOB streaming",
  3279                                 i);
! 3280                             row.append(
! 3281                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
  3282                             // LCOV_EXCL_STOP
  3283                         } else if (dataLen < 0) {
  3284                             LOG("SQLGetData: Unexpected negative data length "
  3285                                 "for column %d (NVARCHAR) - dataLen=%ld",

Lines 4008-4029

  4008                 //
  4009                 // Coverage note: This branch is defensive — the ODBC driver
  4010                 // controls indicator values in bound-column mode, so
  4011                 // SQL_NO_TOTAL cannot be triggered from Python-level tests.
! 4012                 const ColumnInfoExt& noTotalColInfo = columnInfosExt[col - 1];
! 4013                 LOG("SQLGetData: SQL_NO_TOTAL for column %d (dataType=%d), "
! 4014                     "falling back to LOB streaming",
! 4015                     col, noTotalColInfo.dataType);
! 4016                 switch (noTotalColInfo.dataType) {
! 4017                     case SQL_CHAR:
! 4018                     case SQL_VARCHAR:
! 4019                     case SQL_LONGVARCHAR:
! 4020 #if defined(__APPLE__) || defined(__linux__)
! 4021                         PyList_SET_ITEM(row, col - 1,
! 4022                                         FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false,
! 4023                                                            noTotalColInfo.charEncoding)
! 4024                                             .release()
! 4025                                             .ptr());
  4026 #else
  4027                         PyList_SET_ITEM(
  4028                             row, col - 1,
  4029                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")

Lines 4029-4066

  4029                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
  4030                                 .release()
  4031                                 .ptr());
  4032 #endif
! 4033                         break;
! 4034                     case SQL_WCHAR:
! 4035                     case SQL_WVARCHAR:
! 4036                     case SQL_WLONGVARCHAR:
! 4037                     case SQL_SS_XML:
! 4038                         PyList_SET_ITEM(
! 4039                             row, col - 1,
! 4040                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 4041                                 .release()
! 4042                                 .ptr());
! 4043                         break;
! 4044                     case SQL_BINARY:
! 4045                     case SQL_VARBINARY:
! 4046                     case SQL_LONGVARBINARY:
! 4047                     case SQL_SS_UDT:
! 4048                         PyList_SET_ITEM(row, col - 1,
! 4049                                         FetchLobColumnData(hStmt, col, SQL_C_BINARY, false, true)
! 4050                                             .release()
! 4051                                             .ptr());
! 4052                         break;
! 4053                     default:
  4054                         // For fixed-length types SQL_NO_TOTAL should not
  4055                         // occur; treat as NULL to avoid undefined behaviour.
! 4056                         LOG("SQL_NO_TOTAL for unexpected fixed-type column %d "
! 4057                             "(dataType=%d), returning NULL",
! 4058                             col, noTotalColInfo.dataType);
! 4059                         Py_INCREF(Py_None);
! 4060                         PyList_SET_ITEM(row, col - 1, Py_None);
! 4061                         break;
! 4062                 }
  4063                 // LCOV_EXCL_STOP
  4064                 continue;
  4065             }


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review March 19, 2026 09:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses cross-platform inconsistencies when fetching CP1252-collated VARCHAR data by changing the ODBC C-type used for VARCHAR retrieval on Windows, aiming to ensure consistent Unicode (str) results across Windows and Linux/macOS.

Changes:

  • On Windows, fetch/bind VARCHAR (and related SQL_CHAR types) as SQL_C_WCHAR and process via the wide-char path to avoid code page guessing and bytes fallback.
  • On Linux/macOS, clarify/standardize SQL_C_CHAR buffer sizing assumptions for UTF-8 expansion.
  • Add a new CP1252 regression test section covering problematic bytes and multiple fetch paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
mssql_python/pybind/ddbc_bindings.cpp Updates SQLGetData_wrap and batch fetch binding/processing to use SQL_C_WCHAR for VARCHAR on Windows; adjusts buffer sizing logic and LOB detection formatting.
tests/test_013_encoding_decoding.py Adds a new CP1252 VARCHAR consistency test suite intended to ensure VARCHAR returns str across platforms and fetch methods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

@ffelixg ffelixg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, love to see this!

The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:

def test_locale_varchar_decode_iso885915():
    assert locale.getlocale() == ('en_US', 'UTF-8')

    # important: change locale BEFORE connecting
    locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
    connection = connect()
    cursor = connection.cursor()

    (val,) = cursor.execute("SELECT '€'").fetchone()
    # without changing locale, we would get val == '€'
    assert val == b'\xa4', val

    cursor.close()
    connection.close()

If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.

On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants