FEAT: Add ActiveDirectoryMSI support for bulk copy#573
Conversation
Adds Authentication=ActiveDirectoryMSI to the auth pipeline: - Zero-arg ManagedIdentityCredential() for system-assigned MSI. - ManagedIdentityCredential(client_id=UID) for user-assigned MSI, matching ODBC's convention where UID carries the identity's client_id under MSI. - Threads optional credential_kwargs through get_auth_token / get_raw_token / _acquire_token so future auth methods that need constructor args (e.g. ClientSecretCredential) can plug in via the same channel. - Cache key remains a plain string for zero-arg auth types and becomes a tuple when kwargs are present, so different client_ids get separate cached credentials. Partial fix for #534. ServicePrincipal and Password to follow as separate PRs.
There was a problem hiding this comment.
Pull request overview
Adds support for Authentication=ActiveDirectoryMSI (managed identity) token acquisition for the bulk copy (mssql-py-core) path, including user-assigned MSI via UID=<client_id>.
Changes:
- Add
AuthType.MSI(activedirectorymsi) and map it toManagedIdentityCredential. - Thread optional
credential_kwargsthrough token acquisition and update the credential-instance cache key to incorporate kwargs. - Attempt to extract MSI
client_idfrom the connection string and pass it into bulk copy token acquisition; add tests and a changelog entry.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
mssql_python/constants.py |
Adds AuthType.MSI enum value. |
mssql_python/auth.py |
Adds MSI credential support, credential kwargs plumbing, and cache key changes; adds extract_credential_kwargs. |
mssql_python/cursor.py |
Bulk copy now tries to extract MSI credential kwargs and pass them to get_raw_token. |
tests/test_008_auth.py |
Adds tests for MSI auth type parsing, credential construction, cache isolation, and connection-string processing. |
CHANGELOG.md |
Documents MSI support for bulk copy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Connection.__init__ overwrites self.connection_str with the sanitized (UID-stripped) string returned by process_connection_string. The original implementation re-parsed self.connection_str at bulkcopy time via extract_credential_kwargs, which silently dropped the user-assigned MSI client_id and degraded to system-assigned a wrong-identity bug.MSI Changes: - process_connection_string now returns a 4-tuple including the captured credential_kwargs so callers can persist them. - Connection.__init__ stores _credential_kwargs alongside _auth_type. - cursor.bulkcopy() reads self.connection._credential_kwargs instead of re-parsing self.connection_str. - The public extract_credential_kwargs helper is removed (it only existed to support the broken re-parse path; nothing else needs it). - black --line-length=100 reformats (CI was red). Tests: - test_bulkcopy_path_preserves_user_assigned_msi_client_id: invokes cursor.bulkcopy() with a mocked mssql_py_core, patches AADAuth.get_raw_token to capture the args it receives, and asserts the captured credential_kwargs match Connection._credential_kwargs. Fails if cursor reverts to re-parsing self.connection.connection_str. - test_credential_kwargs_persisted_for_user_assigned_msi: asserts Connection.__init__ stores _credential_kwargs from the 4-tuple. - test_credential_kwargs_none_for_system_assigned_msi. - test_credential_kwargs_none_for_non_msi_auth. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bbda2a2 to
3ca2165
Compare
| identity's client_id. Returns None for system-assigned MSI. | ||
| """ | ||
| for param in parameters: | ||
| key, _, value = param.strip().partition("=") |
There was a problem hiding this comment.
Is it possible to get this from the conn str parser map? Assuming we need the conn str UID
There was a problem hiding this comment.
I am not sure what will happen if someone provides a UID={hello=world}
There was a problem hiding this comment.
thanks - missed this, parser should be used as a standard across
that handles {hello=world} correctly
Per @saurabh500's review: braced ODBC values like UID={hello=world} need the canonical parser, not naive partition('='). Without this, the helper returns '{hello=world}' verbatim and ManagedIdentityCredential rejects it. Worse, a UID containing a literal ';' would be truncated. _extract_msi_client_id now delegates to _ConnectionStringParser, which handles braces, escaped '}}' inside braces, and '=' inside braced values correctly. validate_keywords=False so the helper never raises on keys the auth flow doesn't care about. Tests: - test_msi_braced_uid_value_is_unwrapped: UID={hello=world} -> 'hello=world' - test_msi_braced_uid_with_semicolon_is_preserved: UID={abc;def;ghi} Note: process_connection_string and extract_auth_type still use naive split(';') for Authentication= detection across all Entra ID auth types. That's pre-existing and tracked separately for a parser-wide refactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…microsoft/mssql-python into bewithgaurav/gh534-bulkcopy-msi
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Debug log distinguishing user-assigned vs system-assigned MSI when the user passes Authentication=ActiveDirectoryMSI. Helps diagnose which branch was taken when token acquisition fails. Logs client_id length, not value (still identity material). - Comment above _credential_cache explains the cache key shape so the unbounded growth is understood as a deliberate choice rather than an oversight. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%🔗 Quick Links
|
Connection.__init__ already parses the same connection string through _ConnectionStringParser via _construct_connection_string (connection.py line 253) before process_connection_string is ever called. By the time _extract_msi_client_id runs, the input is guaranteed parseable. The try/except was dead code. A real parse failure here would indicate an upstream bug and should propagate, not silently degrade user-assigned MSI to system-assigned (which is the wrong-identity failure mode this PR exists to prevent). Brings diff coverage to 100%. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # time we get here the input is guaranteed parseable. No defensive | ||
| # try/except: a parse failure now means a real bug upstream and should | ||
| # propagate, not silently degrade user-assigned MSI to system-assigned. | ||
| parsed = _ConnectionStringParser(validate_keywords=False)._parse(connection_string) |
There was a problem hiding this comment.
Functionally correct but a tiny perf hit due to double parsing of connections string.
Perfect solution would require parsing done earlier in the connect api and passing around the map to this function. That's a bigger change. I recommend a follow up PR and tracking with a GH issue.
An adhoc fix is to maintain a hashmap of connection string and uid. But that's prone to other problems esp concurrency.
…tructure Brings in the ActiveDirectoryMSI feature (PR #573, mostly approved) so SP can use the shared module-level credential cache that PR #573 introduces, instead of the closure-scoped cache the SP commit shipped with. Conflicts resolved (kept both contributions): - mssql_python/constants.py: AuthType.MSI + AuthType.SERVICE_PRINCIPAL - mssql_python/auth.py: _parse_tenant_id + ServicePrincipalAuth + _extract_msi_client_id; both elif branches in process_auth_parameters; both keys in extract_auth_type's auth_map - mssql_python/cursor.py: SP factory branch on _auth_type=='serviceprincipal', Model A branch threads credential_kwargs through AADAuth.get_raw_token (MSI's contract change) - tests/test_008_auth.py: MockClientSecretCredential + MockManagedIdentityCredential in setup_azure_identity, both sets of TestProcessAuthParameters / TestExtractAuthType cases ServicePrincipalAuth.make_token_factory now uses the shared _credential_cache + _credential_cache_key (introduced for MSI), keyed by ('serviceprincipal', tenant_id, client_id). Closure-scoped cache and the TODO(#573-followup) note removed. client_secret is intentionally NOT in the cache key because credentials are looked up by identity, not secret; a rotated secret will surface as ClientAuthenticationError, not a silent hit on the wrong credential. Tests: 86 passed in tests/test_008_auth.py (was 75 SP-only). Black clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # On Windows Interactive, process_connection_string returns None | ||
| # (DDBC handles auth natively), so fall back to the connection string. | ||
| self._auth_type = connection_result[2] or extract_auth_type(self.connection_str) | ||
| self._credential_kwargs = connection_result[3] |
There was a problem hiding this comment.
process_connection_string changed from 3-tuple to 4-tuple return. This is internal so it's fine, but worth noting in the docstring that callers must unpack all 4. If a third-party or test is doing a, b, c = process_connection_string(...), they'll get a ValueError with no obvious cause. The existing tests were all updated - just flagging for awareness.
There was a problem hiding this comment.
good call. the 4-tuple is interim. I have filed #580 to refactor process_connection_string to take the parsed connection-string map instead of the raw string (also avoids the 3x parse on the connect path). once that lands, the 4-tuple goes away and callers move to dict access.
for now, added a short docstring note about the 4-tuple to cover anyone reaching in before #580 lands.
Tracking refactor (parse-once, thread the parsed map through the auth path) is a separate follow-up; this docstring helps anyone reaching into the function before that lands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Work Item / Issue Reference
Summary
Adds
Authentication=ActiveDirectoryMSIsupport to bulk copy. Partial fix for #534.ManagedIdentityCredential()for system-assigned MSI.ManagedIdentityCredential(client_id=UID)for user-assigned MSI. Matches ODBC convention whereUIDcarries the identity'sclient_idunder MSI.credential_kwargsthroughget_auth_token/get_raw_token/_acquire_tokenso future auth methods that need constructor args plug in via the same channel.(auth_type, sorted_kwargs)when kwargs are present, so differentclient_ids get separate cached credentials.mssql-py-core(the Rust path used by bulk copy) doesn't itself acquire Entra tokens. Python pre-acquires a JWT and passes it asaccess_token. Today this works forDefault,DeviceCode, andInteractive. MSI was missing, the most common Azure-hosted-service auth method.ServicePrincipal and Password are explicitly out of scope for this PR. They need different design work and ship separately.
How user-assigned MSI survives the bulkcopy fresh-token path
Connection.__init__callsprocess_connection_string, which sanitizesself.connection_str(stripsUID=,Authentication=,PWD=). Bulkcopy needs a fresh token at copy time. Re-parsingself.connection_strat that point would lose UID and silently degrade user-assigned MSI to system-assigned (wrong-identity bug).Fix shape:
process_connection_stringreturns a 4-tuple including the capturedcredential_kwargs.Connection.__init__persists_credential_kwargsalongside_auth_type.cursor.bulkcopy()readsself.connection._credential_kwargsdirectly. No re-parse, no chance for the sanitized string to drop theclient_id.Connection string examples