Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

Open
subrata-ms wants to merge 8 commits intomainfrom
subrata-ms/CP1252Encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 8 commits intomainfrom
subrata-ms/CP1252Encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Apr 3, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

  • The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)

  • All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

  • The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

  • Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

@github-actions github-actions Bot added the pr-size: large Substantial code update label Apr 3, 2026
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

84%


🎯 Overall Coverage

79%


📈 Total Lines Covered: 6902 out of 8719
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/connection.py (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (87.9%): Missing lines 2060,3410-3415,3422-3433,3435-3439,4968-4970,5228-5229,5279-5281,5582-5583
  • mssql_python/pybind/ddbc_bindings.h (50.0%): Missing lines 844-853,859-863

Summary

  • Total: 311 lines
  • Missing: 49 lines
  • Coverage: 84%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2056-2064

  2056                         size_t chunkBytes = DAE_CHUNK_SIZE;
  2057                         while (offset < totalBytes) {
  2058                             size_t len = std::min(chunkBytes, totalBytes - offset);
  2059 
! 2060                             rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
  2061                             if (!SQL_SUCCEEDED(rc)) {
  2062                                 LOG("SQLExecute: SQLPutData failed for "
  2063                                     "SQL_C_CHAR chunk - offset=%zu",
  2064                                     offset, totalBytes, len, rc);

Lines 3406-3419

  3406                                     "length=%lu",
  3407                                     i, (unsigned long)numCharsInData);
  3408                             } else {
  3409                                 // Buffer too small, fallback to streaming
! 3410                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3411                                     "truncated, using streaming LOB",
! 3412                                     i);
! 3413                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3414                                                               "utf-16le"));
! 3415                             }
  3416                         } else if (dataLen == SQL_NULL_DATA) {
  3417                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3418                             row.append(py::none());
  3419                         } else if (dataLen == 0) {

Lines 3418-3443

  3418                             row.append(py::none());
  3419                         } else if (dataLen == 0) {
  3420                             row.append(py::str(""));
  3421                         } else if (dataLen == SQL_NO_TOTAL) {
! 3422                             LOG("SQLGetData: Cannot determine data length "
! 3423                                 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3424                                 "returning NULL",
! 3425                                 i);
! 3426                             row.append(py::none());
! 3427                         } else if (dataLen < 0) {
! 3428                             LOG("SQLGetData: Unexpected negative data length "
! 3429                                 "for column %d - dataType=%d, dataLen=%ld",
! 3430                                 i, dataType, (long)dataLen);
! 3431                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3432                                               "data length");
! 3433                         }
  3434                     } else {
! 3435                         LOG("SQLGetData: Error retrieving data for column %d "
! 3436                             "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3437                             i, ret);
! 3438                         row.append(py::none());
! 3439                     }
  3440                 } else {
  3441                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3442                     //
  3443                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 4964-4974

  4964                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4965                 break;
  4966             default:
  4967                 std::ostringstream errorString;
! 4968                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4969                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4970                             << (i + 1);
  4971                 LOG(errorString.str().c_str());
  4972                 ThrowStdException(errorString.str());
  4973                 break;
  4974         }

Lines 5224-5233

  5224                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5225                                                  sizeof(DateTimeOffset),
  5226                                                  buffers.indicators[idxCol].data());
  5227                             if (!SQL_SUCCEEDED(ret)) {
! 5228                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5229                                     idxCol + 1);
  5230                                 return ret;
  5231                             }
  5232                             break;
  5233                         }

Lines 5275-5285

  5275                     nullCounts[idxCol] += 1;
  5276                     continue;
  5277                 } else if (indicator < 0) {
  5278                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5279                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5280                         "Length - %lld",
! 5281                         idxCol + 1, dataType, (long long)indicator);
  5282                     ThrowStdException("Unexpected negative data length.");
  5283                 }
  5284                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5578-5587

  5578         arrowSchemaBatchCapsule =
  5579             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5580                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5581                 if (arrowSchema->release) {
! 5582                     arrowSchema->release(arrowSchema);
! 5583                 }
  5584                 delete arrowSchema;
  5585             });
  5586     } catch (...) {
  5587         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  840                 // Decode failed — fall back to returning the raw UTF-16LE bytes
  841                 // (consistent with the narrow-char path below and with
  842                 // FetchLobColumnData / SQLGetData_wrap which return raw bytes on
  843                 // decode failure instead of silently dropping data as None).
! 844                 PyErr_Clear();
! 845                 PyObject* pyBytes = PyBytes_FromStringAndSize(
! 846                     reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));
! 847                 if (pyBytes) {
! 848                     PyList_SET_ITEM(row, col - 1, pyBytes);
! 849                 } else {
! 850                     PyErr_Clear();
! 851                     Py_INCREF(Py_None);
! 852                     PyList_SET_ITEM(row, col - 1, Py_None);
! 853                 }
  854             } else {
  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {

  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {
  858             // LOB / truncated: stream with SQL_C_WCHAR
! 859             PyList_SET_ITEM(row, col - 1,
! 860                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 861                                 .release()
! 862                                 .ptr());
! 863         }
  864         return;
  865     }
  866 
  867     // Original narrow-char path (charCtype == SQL_C_CHAR)


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 66.5%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.4%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review April 3, 2026 08:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

  • Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
  • Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
  • Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mssql_python/connection.py Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR.
mssql_python/cursor.py Passes updated decoding settings (encoding + ctype) into the native fetch functions.
mssql_python/pybind/ddbc_bindings.cpp Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching.
mssql_python/pybind/ddbc_bindings.h Extends per-column metadata and adds a wide-char branch in ProcessChar.
tests/test_013_encoding_decoding.py Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/connection.py
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread tests/test_013_encoding_decoding.py Outdated
Comment thread tests/test_013_encoding_decoding.py
Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
row.append(py::none());
} else if (dataLen == 0) {
row.append(py::str(""));
} else if (dataLen == SQL_NO_TOTAL) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new wide-char SQLGetData path, when dataLen == SQL_NO_TOTAL , the code logs a message and appends py::none(). This is indistinguishable from a genuine SQL NULL to the caller.

The existing narrow-char path (and the LOB check at the top of the case block) correctly handles SQL_NO_TOTAL by falling through to FetchLobColumnData for streaming. Shouldn't the wide-char path should do the same?

PS: SQL_NO_TOTAL means the driver can't determine the length upfront - it does not mean the data is absent. Returning None here could silently loses data.

"data length");
}
} else {
LOG("SQLGetData: Error retrieving data for column %d "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, the user silently gets None with no way to know data retrieval failed.

In the narrow-char path (which correctly returns the error SQLRETURN and lets the caller handle it), the wide-char path should be consistent with it.

Or, if there's a specific reason to be lenient here, at minimum can we throw an exception so the error is observable? Silently converting errors to NULL values is a data integrity risk.

ConstantsDDBC.SQL_CHAR.value: {
"encoding": "utf-8",
"ctype": ConstantsDDBC.SQL_CHAR.value,
"encoding": "utf-16le",
Copy link
Copy Markdown
Contributor

@sumitmsft sumitmsft May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the default SQL_CHAR decoding from utf-8 + SQL_C_CHAR to utf-16le + SQL_C_WCHAR for all users on all platforms. While the fix is correct for the CP1252 issue, this is a silent behavioral change that affects every existing consumer.

Add a changelog / release notes entry explicitly calling out this default change and the motivation (CP1252 mismatch).

Copy link
Copy Markdown
Contributor

@sumitmsft sumitmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Otherwise, I am ok with this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants