FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 by subrata-ms · Pull Request #495 · microsoft/mssql-python

subrata-ms · 2026-04-03T06:43:52Z

Work Item / Issue Reference

AB#43177

GitHub Issue: #468

Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)
All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

github-actions · 2026-04-03T07:43:00Z

📊 Code Coverage Report

🔥 Diff Coverage 84%	🎯 Overall Coverage 79%	📈 Total Lines Covered: `6902` out of `8719` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/connection.py (100%)
mssql_python/pybind/ddbc_bindings.cpp (87.9%): Missing lines 2060,3410-3415,3422-3433,3435-3439,4968-4970,5228-5229,5279-5281,5582-5583
mssql_python/pybind/ddbc_bindings.h (50.0%): Missing lines 844-853,859-863

Summary

Total: 311 lines
Missing: 49 lines
Coverage: 84%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2056-2064

  2056                         size_t chunkBytes = DAE_CHUNK_SIZE;
  2057                         while (offset < totalBytes) {
  2058                             size_t len = std::min(chunkBytes, totalBytes - offset);
  2059 
! 2060                             rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
  2061                             if (!SQL_SUCCEEDED(rc)) {
  2062                                 LOG("SQLExecute: SQLPutData failed for "
  2063                                     "SQL_C_CHAR chunk - offset=%zu",
  2064                                     offset, totalBytes, len, rc);

Lines 3406-3419

  3406                                     "length=%lu",
  3407                                     i, (unsigned long)numCharsInData);
  3408                             } else {
  3409                                 // Buffer too small, fallback to streaming
! 3410                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3411                                     "truncated, using streaming LOB",
! 3412                                     i);
! 3413                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3414                                                               "utf-16le"));
! 3415                             }
  3416                         } else if (dataLen == SQL_NULL_DATA) {
  3417                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3418                             row.append(py::none());
  3419                         } else if (dataLen == 0) {

Lines 3418-3443

  3418                             row.append(py::none());
  3419                         } else if (dataLen == 0) {
  3420                             row.append(py::str(""));
  3421                         } else if (dataLen == SQL_NO_TOTAL) {
! 3422                             LOG("SQLGetData: Cannot determine data length "
! 3423                                 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3424                                 "returning NULL",
! 3425                                 i);
! 3426                             row.append(py::none());
! 3427                         } else if (dataLen < 0) {
! 3428                             LOG("SQLGetData: Unexpected negative data length "
! 3429                                 "for column %d - dataType=%d, dataLen=%ld",
! 3430                                 i, dataType, (long)dataLen);
! 3431                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3432                                               "data length");
! 3433                         }
  3434                     } else {
! 3435                         LOG("SQLGetData: Error retrieving data for column %d "
! 3436                             "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3437                             i, ret);
! 3438                         row.append(py::none());
! 3439                     }
  3440                 } else {
  3441                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3442                     //
  3443                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 4964-4974

  4964                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4965                 break;
  4966             default:
  4967                 std::ostringstream errorString;
! 4968                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4969                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4970                             << (i + 1);
  4971                 LOG(errorString.str().c_str());
  4972                 ThrowStdException(errorString.str());
  4973                 break;
  4974         }

Lines 5224-5233

  5224                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5225                                                  sizeof(DateTimeOffset),
  5226                                                  buffers.indicators[idxCol].data());
  5227                             if (!SQL_SUCCEEDED(ret)) {
! 5228                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5229                                     idxCol + 1);
  5230                                 return ret;
  5231                             }
  5232                             break;
  5233                         }

Lines 5275-5285

  5275                     nullCounts[idxCol] += 1;
  5276                     continue;
  5277                 } else if (indicator < 0) {
  5278                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5279                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5280                         "Length - %lld",
! 5281                         idxCol + 1, dataType, (long long)indicator);
  5282                     ThrowStdException("Unexpected negative data length.");
  5283                 }
  5284                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5578-5587

  5578         arrowSchemaBatchCapsule =
  5579             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5580                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5581                 if (arrowSchema->release) {
! 5582                     arrowSchema->release(arrowSchema);
! 5583                 }
  5584                 delete arrowSchema;
  5585             });
  5586     } catch (...) {
  5587         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  840                 // Decode failed — fall back to returning the raw UTF-16LE bytes
  841                 // (consistent with the narrow-char path below and with
  842                 // FetchLobColumnData / SQLGetData_wrap which return raw bytes on
  843                 // decode failure instead of silently dropping data as None).
! 844                 PyErr_Clear();
! 845                 PyObject* pyBytes = PyBytes_FromStringAndSize(
! 846                     reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));
! 847                 if (pyBytes) {
! 848                     PyList_SET_ITEM(row, col - 1, pyBytes);
! 849                 } else {
! 850                     PyErr_Clear();
! 851                     Py_INCREF(Py_None);
! 852                     PyList_SET_ITEM(row, col - 1, Py_None);
! 853                 }
  854             } else {
  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {

  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {
  858             // LOB / truncated: stream with SQL_C_WCHAR
! 859             PyList_SET_ITEM(row, col - 1,
! 860                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 861                                 .release()
! 862                                 .ptr());
! 863         }
  864         return;
  865     }
  866 
  867     // Original narrow-char path (charCtype == SQL_C_CHAR)

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 66.5%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.4%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

Copilot

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`mssql_python/connection.py`	Changes default decoding settings for `SQL_CHAR` to UTF-16LE/`SQL_WCHAR`.
`mssql_python/cursor.py`	Passes updated decoding settings (encoding + ctype) into the native fetch functions.
`mssql_python/pybind/ddbc_bindings.cpp`	Adds `charCtype` plumb-through and implements `SQL_C_WCHAR` paths for `SQLGetData` and bound-column fetching.
`mssql_python/pybind/ddbc_bindings.h`	Extends per-column metadata and adds a wide-char branch in `ProcessChar`.
`tests/test_013_encoding_decoding.py`	Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sumitmsft · 2026-05-08T09:36:27Z

+                            row.append(py::none());
+                        } else if (dataLen == 0) {
+                            row.append(py::str(""));
+                        } else if (dataLen == SQL_NO_TOTAL) {


In the new wide-char SQLGetData path, when dataLen == SQL_NO_TOTAL , the code logs a message and appends py::none(). This is indistinguishable from a genuine SQL NULL to the caller.

The existing narrow-char path (and the LOB check at the top of the case block) correctly handles SQL_NO_TOTAL by falling through to FetchLobColumnData for streaming. Shouldn't the wide-char path should do the same?

PS: SQL_NO_TOTAL means the driver can't determine the length upfront - it does not mean the data is absent. Returning None here could silently loses data.

sumitmsft · 2026-05-08T13:51:20Z

+                                              "data length");
+                        }
+                    } else {
+                        LOG("SQLGetData: Error retrieving data for column %d "


Here also, the user silently gets None with no way to know data retrieval failed.

In the narrow-char path (which correctly returns the error SQLRETURN and lets the caller handle it), the wide-char path should be consistent with it.

Or, if there's a specific reason to be lenient here, at minimum can we throw an exception so the error is observable? Silently converting errors to NULL values is a data integrity risk.

sumitmsft · 2026-05-08T14:04:59Z

            ConstantsDDBC.SQL_CHAR.value: {
-                "encoding": "utf-8",
-                "ctype": ConstantsDDBC.SQL_CHAR.value,
+                "encoding": "utf-16le",


This changes the default SQL_CHAR decoding from utf-8 + SQL_C_CHAR to utf-16le + SQL_C_WCHAR for all users on all platforms. While the fix is correct for the CP1252 issue, this is a silent behavioral change that affects every existing consumer.

Add a changelog / release notes entry explicitly calling out this default change and the motivation (CP1252 mismatch).

sumitmsft

Left some minor comments. Otherwise, I am ok with this change

base code changes

14d1f1d

github-actions Bot added the pr-size: large Substantial code update label Apr 3, 2026

github-advanced-security AI found potential problems Apr 3, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed

Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed

subrata-ms and others added 2 commits April 3, 2026 12:32

Merge branch 'main' into subrata-ms/CP1252Encoding

8817007

code fix for linux based system

597ee4c

subrata-ms marked this pull request as ready for review April 3, 2026 08:10

subrata-ms requested review from Copilot, gargsaumya and sumitmsft April 3, 2026 08:10

Copilot started reviewing on behalf of subrata-ms April 3, 2026 08:10 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Comment thread mssql_python/connection.py

Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

Comment thread tests/test_013_encoding_decoding.py Outdated

subrata-ms mentioned this pull request Apr 3, 2026

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 #478

Closed

gargsaumya reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/test_013_encoding_decoding.py

Comment thread mssql_python/pybind/ddbc_bindings.cpp

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

subrata-ms and others added 4 commits May 6, 2026 20:21

Merge branch 'main' into subrata-ms/CP1252Encoding

fcf88c0

Merge branch 'main' into subrata-ms/CP1252Encoding

7b225bc

fixing PR comments

5e6dc19

fixing external set scenario

6dec9bf

jahnvi480 mentioned this pull request May 8, 2026

FIX: Fetch VARCHAR UTF-8 collation columns as SQL_C_WCHAR on Windows to prevent lossy ACP conversion #543

Closed

test fix

6818f7e

sumitmsft reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 8 commits intomainfrom
subrata-ms/CP1252Encoding

subrata-ms commented Apr 3, 2026 •

edited by azure-boards Bot

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 3, 2026 •

edited

Loading

🔥 Diff Coverage

84%

🎯 Overall Coverage

79%

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitmsft May 8, 2026

Uh oh!

sumitmsft May 8, 2026

Uh oh!

sumitmsft May 8, 2026 •

edited

Loading

Uh oh!

sumitmsft left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

subrata-ms commented Apr 3, 2026 • edited by azure-boards Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Code Coverage Report

🔥 Diff Coverage

84%

🎯 Overall Coverage

79%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/ddbc_bindings.h

📋 Files Needing Attention

🔗 Quick Links

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitmsft May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sumitmsft May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sumitmsft May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumitmsft left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

subrata-ms commented Apr 3, 2026 •

edited by azure-boards Bot

Loading

github-actions Bot commented Apr 3, 2026 •

edited

Loading

sumitmsft May 8, 2026 •

edited

Loading