FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495subrata-ms wants to merge 8 commits intomainfrom
Conversation
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 2056-2064 2056 size_t chunkBytes = DAE_CHUNK_SIZE;
2057 while (offset < totalBytes) {
2058 size_t len = std::min(chunkBytes, totalBytes - offset);
2059
! 2060 rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
2061 if (!SQL_SUCCEEDED(rc)) {
2062 LOG("SQLExecute: SQLPutData failed for "
2063 "SQL_C_CHAR chunk - offset=%zu",
2064 offset, totalBytes, len, rc);Lines 3406-3419 3406 "length=%lu",
3407 i, (unsigned long)numCharsInData);
3408 } else {
3409 // Buffer too small, fallback to streaming
! 3410 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3411 "truncated, using streaming LOB",
! 3412 i);
! 3413 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3414 "utf-16le"));
! 3415 }
3416 } else if (dataLen == SQL_NULL_DATA) {
3417 LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
3418 row.append(py::none());
3419 } else if (dataLen == 0) {Lines 3418-3443 3418 row.append(py::none());
3419 } else if (dataLen == 0) {
3420 row.append(py::str(""));
3421 } else if (dataLen == SQL_NO_TOTAL) {
! 3422 LOG("SQLGetData: Cannot determine data length "
! 3423 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3424 "returning NULL",
! 3425 i);
! 3426 row.append(py::none());
! 3427 } else if (dataLen < 0) {
! 3428 LOG("SQLGetData: Unexpected negative data length "
! 3429 "for column %d - dataType=%d, dataLen=%ld",
! 3430 i, dataType, (long)dataLen);
! 3431 ThrowStdException("SQLGetData returned an unexpected negative "
! 3432 "data length");
! 3433 }
3434 } else {
! 3435 LOG("SQLGetData: Error retrieving data for column %d "
! 3436 "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3437 i, ret);
! 3438 row.append(py::none());
! 3439 }
3440 } else {
3441 // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
3442 //
3443 // Why this differs from SQLBindColums / FetchBatchData:Lines 4964-4974 4964 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
4965 break;
4966 default:
4967 std::ostringstream errorString;
! 4968 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4969 << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4970 << (i + 1);
4971 LOG(errorString.str().c_str());
4972 ThrowStdException(errorString.str());
4973 break;
4974 }Lines 5224-5233 5224 buffers.datetimeoffsetBuffers[idxCol].data(),
5225 sizeof(DateTimeOffset),
5226 buffers.indicators[idxCol].data());
5227 if (!SQL_SUCCEEDED(ret)) {
! 5228 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5229 idxCol + 1);
5230 return ret;
5231 }
5232 break;
5233 }Lines 5275-5285 5275 nullCounts[idxCol] += 1;
5276 continue;
5277 } else if (indicator < 0) {
5278 // Negative value is unexpected, log column index, SQL type & raise exception
! 5279 LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5280 "Length - %lld",
! 5281 idxCol + 1, dataType, (long long)indicator);
5282 ThrowStdException("Unexpected negative data length.");
5283 }
5284 auto dataLen = static_cast<uint64_t>(indicator);Lines 5578-5587 5578 arrowSchemaBatchCapsule =
5579 py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
5580 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
5581 if (arrowSchema->release) {
! 5582 arrowSchema->release(arrowSchema);
! 5583 }
5584 delete arrowSchema;
5585 });
5586 } catch (...) {
5587 arrowSchemaBatch->release(arrowSchemaBatch.get());mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 66.5%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.4%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%🔗 Quick Links
|
There was a problem hiding this comment.
Pull request overview
This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.
Changes:
- Update connection defaults so
SQL_CHARdecoding usesencoding="utf-16le"withctype=SQL_WCHAR. - Plumb
charCtypethrough cursor fetch APIs into the C++ bindings, enabling wide-char fetch forSQL_CHARcolumns. - Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
mssql_python/connection.py |
Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR. |
mssql_python/cursor.py |
Passes updated decoding settings (encoding + ctype) into the native fetch functions. |
mssql_python/pybind/ddbc_bindings.cpp |
Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching. |
mssql_python/pybind/ddbc_bindings.h |
Extends per-column metadata and adds a wide-char branch in ProcessChar. |
tests/test_013_encoding_decoding.py |
Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| row.append(py::none()); | ||
| } else if (dataLen == 0) { | ||
| row.append(py::str("")); | ||
| } else if (dataLen == SQL_NO_TOTAL) { |
There was a problem hiding this comment.
In the new wide-char SQLGetData path, when dataLen == SQL_NO_TOTAL , the code logs a message and appends py::none(). This is indistinguishable from a genuine SQL NULL to the caller.
The existing narrow-char path (and the LOB check at the top of the case block) correctly handles SQL_NO_TOTAL by falling through to FetchLobColumnData for streaming. Shouldn't the wide-char path should do the same?
PS: SQL_NO_TOTAL means the driver can't determine the length upfront - it does not mean the data is absent. Returning None here could silently loses data.
| "data length"); | ||
| } | ||
| } else { | ||
| LOG("SQLGetData: Error retrieving data for column %d " |
There was a problem hiding this comment.
Here also, the user silently gets None with no way to know data retrieval failed.
In the narrow-char path (which correctly returns the error SQLRETURN and lets the caller handle it), the wide-char path should be consistent with it.
Or, if there's a specific reason to be lenient here, at minimum can we throw an exception so the error is observable? Silently converting errors to NULL values is a data integrity risk.
| ConstantsDDBC.SQL_CHAR.value: { | ||
| "encoding": "utf-8", | ||
| "ctype": ConstantsDDBC.SQL_CHAR.value, | ||
| "encoding": "utf-16le", |
There was a problem hiding this comment.
This changes the default SQL_CHAR decoding from utf-8 + SQL_C_CHAR to utf-16le + SQL_C_WCHAR for all users on all platforms. While the fix is correct for the CP1252 issue, this is a silent behavioral change that affects every existing consumer.
Add a changelog / release notes entry explicitly calling out this default change and the motivation (CP1252 mismatch).
sumitmsft
left a comment
There was a problem hiding this comment.
Left some minor comments. Otherwise, I am ok with this change
Work Item / Issue Reference
Summary
This pull request updates the default handling of SQL
CHAR/VARCHARcolumns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.Default Encoding and Decoding Changes:
The default decoding for SQL
CHARcolumns is now set to use"utf-16le"encoding and theSQL_WCHARctype, replacing the previous"utf-8"/SQL_CHARdefaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)All cursor fetch methods (
fetchone,fetchmany,fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetchingCHARdata, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]C++ Binding and Processing Updates:
ColumnInfoExtstruct now tracks whether wide character (UTF-16) fetching is used for a column, and theProcessCharfunction is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]Test Adjustments:
"utf-16le"andSQL_WCHARas the default decoding settings forSQL_CHARcolumns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]