Skip to content

!PrintException -lines drops source-line bracket for unwound stack-trace methods in heap minidumps #5909

Description

@max-charlamb

Summary

!PrintException -lines intermittently fails to resolve the [file @ N] source-line bracket for individual frames in an exception's StackTrace. The failure is deterministic for any given dump but appears flaky in CI because it depends on whether the affected method is currently on a live thread's stack and on the JIT's native-code offset for the exception throw point.

Sample repro below shows the bug fires with high reliability when the throwing method is unwound off the live stack and is large enough that its return IP lies in a different NibbleMap byte than its entry IP. This affects all CoreCLR runtimes when consuming heap-type (not heap2) minidumps.

Found after investigating CI failure on SOSExceptionTests.TaskNestedException(config: projectk.sdk.prebuilt.9.0.14) (Windows x86 Release leg, internal pipeline).

Reproducer

Debuggee (HeavyExn.csproj, net9.0/win-x86/Release)

using System;
using System.Runtime.CompilerServices;

internal static class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static long HeavyMethod(long start)
    {
        long x = start;
        // 400 lines of trivial arithmetic, expanding to ~30 KB of x86 codegen
        x = x * 31 + 0; if (x == long.MinValue) Environment.Exit(99);
        x = x * 31 + 1; if (x == long.MinValue) Environment.Exit(99);
        // ...
        x = x * 31 + 399; if (x == long.MinValue) Environment.Exit(99);
        throw new InvalidOperationException("thrown from end of HeavyMethod");
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static Exception Capture(long seed)
    {
        try { HeavyMethod(seed); return null; }
        catch (Exception e) { return e; }
    }

    static void Main(string[] args)
    {
        long seed = args.Length > 0 ? long.Parse(args[0]) : 0;
        Exception inner = Capture(seed);
        throw new AggregateException("outer holds the inner", inner);
    }
}

Capture script

$env:DOTNET_DbgEnableMiniDump = '1'
$env:DOTNET_DbgMiniDumpType   = '2'  # Heap (MiniDumpWithPrivateReadWriteMemory)
$env:DOTNET_DbgMiniDumpName   = "C:\dumps\heavy.dmp"
.\HeavyExn.exe

Investigation

0:000> !PrintException -lines <innerExceptionAddr>
StackTrace (generated):
    SP       IP       Function
    02F7E580 08D8A4FB HeavyExn!HeavyExn.Program.HeavyMethod(Int64)+0x76e3                                  ← NO bracket
    02F7F218 08D82DEE HeavyExn!HeavyExn.Program.Capture(Int64)+0x26
                      [C:\...\Program.cs @ 418]                                                            ← bracket OK

Repro rate (Windows x86 Release)

Ran HeavyExn 30 times, capturing a heap dump and running !PrintException -lines on each via cdb + locally-built SOS:

Runtime DOTNET_EnableFastHeapDumps HeavyMethod bracket present Failure rate
net9.0.17 unset 0 / 30 100%
net9.0.17 1 (env var ignored on 9.0) 0 / 30 100%
net8.0.28 unset 0 / 30 100%
net8.0.28 1 30 / 30 0%

The bug is deterministic for the same target binary. CI flakiness comes from method-size proximity to the NibbleMap-granularity boundary — TaskNestedException's RandomUserTask.InnerException() at +0x54 happens to straddle the boundary, while a larger method like HeavyMethod at +0x76e3 fails every time.

Root cause

On Windows, createdump spawns and calls MiniDumpWriteDump(..., NULL, NULL, NULL). The OS dbgcore.dll recognizes loaded coreclr.dll and invokes the CLR DAC's ICLRDataEnumMemoryRegions::EnumMemoryRegions to collect runtime-aware memory regions. On net9 with heap dump type, this goes through ClrDataAccess::EnumMemoryRegionsEnumMemoryRegionsWorkerHeap(CLRDATA_ENUM_MEM_HEAP) (enummem.cpp).

EnumMemoryRegionsWorkerHeap then calls EnumMemDumpAllThreadsStack(flags), which for each managed thread:

  1. Walks the current stack and for live frames calls EECodeInfo(actualReturnIP) — this touches the NibbleMap byte for the return-IP and pulls it into the dump.
  2. Walks each exception's _stackTrace via DumpManagedExcepObject (source). For each StackTraceElement, it does:
for (size_t i = 0; i < stackTrace.Size(); i++)
{
    MethodDesc* pMD = stackTrace[i].pFunc;
    if (!DacHasMethodDescBeenEnumerated(pMD) && DacValidateMD(pMD))
    {
        pMD->EnumMemoryRegions(flags);
        FindLoadedMethodRefOrDef(pMD->GetMethodTable()->GetModule(), pMD->GetMemberDef());
        DebugInfoManager::EnumMemoryRegionsForMethodDebugInfo(flags, pMD);
        PCODE addr = pMD->GetNativeCode();                                  // ← METHOD ENTRY, not stackTrace[i].ip
        if (addr != (PCODE)NULL)
        {
            EECodeInfo codeInfo(addr);                                       // ← touches NibbleMap for ENTRY only
            if (codeInfo.IsValid())
            {
                IJitManager::MethodRegionInfo methodRegionInfo = { 0 };
                codeInfo.GetMethodRegionInfo(&methodRegionInfo);
            }
        }
    }
    DacEnumCodeForStackwalk(stackTrace[i].ip);                              // small window around IP
}

The EECodeInfo is constructed from pMD->GetNativeCode() (the method entry), not from stackTrace[i].ip (the actual return IP). EECodeInfo::Init walks the RangeSection and reads the NibbleMap byte covering the entry. For methods larger than NIBBLE_GRANULARITY (~16 bytes per byte / 256 bytes per cache line on x86), the byte for entry and the byte for entry+0x54 (or entry+0x76e3) live in different addresses — only the entry-IP byte is pulled in.

DacEnumCodeForStackwalk(stackTrace[i].ip) does not touch the NibbleMap for the IP; it grabs a small window of executable bytes.

Why live frames are unaffected

EnumMemWalkStackHelper calls EECodeInfo(addr) where addr = GetControlPC(&regDisp) — the actual return address. The NibbleMap byte for that exact return IP gets captured.

Why small methods are unaffected

If method size ≤ NibbleMap granularity, the entry NibbleMap byte covers all return IPs within the method too.

What !PrintException -lines does

Per-frame, it calls GetLineByOffset(ste.ip)ConvertNativeToIlOffset(ip)GetClrMethodInstance(ip) → DAC IXCLRDataProcess::StartEnumMethodInstancesByAddress(ip)ExecutionManager::GetCodeMethodDesc(ip)FindCodeRange(ip)EEJitManager::FindMethodCode which reads the NibbleMap byte for ip. If that byte isn't in the dump, this returns FALSEGetClrMethodInstance fails → no IL-offset mapping → no [file @ N] bracket.

Confirming the mechanism with cdb

0:000> dd 0907E910 L8       # RealCodeHeader for InnerException
0907e910  0907e938 00000000 0907e920 090828d4
0907e920  ???????? ???????? ???????? ????????   ← GCInfo blob, NOT captured

0:000> dd 08A3AB1C L1       # NibbleMap byte for return IP 0x09041A14
08a3ab1c  ????????                              ← NibbleMap entry, NOT captured

0:000> !IP2MD 090419C0      # method entry
MethodDesc: 090828d4
Source file: ...RandomUserTask.cs @ 37          ← entry-NibbleMap-cell IS captured

0:000> !IP2MD 09041A14      # return IP
Failed to request MethodData, not in JIT code range

The DAC has only the method entry's NibbleMap byte — not the one for the return IP.

Fix options

A. SOS-side fallback (preferred)

Thread an optional MethodDesc hint through ConvertNativeToIlOffset. When GetClrMethodInstance(ip) (the IP→RangeSection lookup) fails and the caller supplied a hint, derive an IXCLRDataMethodInstance from the MD instead:

HRESULT
ConvertNativeToIlOffset(ULONG64 nativeOffset, BOOL bAdjust,
                        ULONG64 methodDescHint,           // NEW
                        IXCLRDataModule** ppModule,
                        mdMethodDef* methodToken, PULONG32 methodOffs)
{
    ToRelease<IXCLRDataMethodInstance> pMethodInst(NULL);
    HRESULT Status = GetClrMethodInstance(nativeOffset, &pMethodInst);
    if (Status != S_OK && methodDescHint != 0)
    {
        // Bypass the RangeSection walk via MD → Module → MethodDef → EnumInstance
        Status = GetClrMethodInstanceFromMethodDesc(methodDescHint, &pMethodInst);
    }
    if (FAILED(Status)) return Status;
    // ... rest unchanged: GetILOffsetsByAddress / GetTokenAndScope / etc.
}

FormatException passes (ULONG64)ste.pFunc as the hint.

Pros:

  • ~30 LOC in SOS only, no DAC/runtime changes
  • Works on every existing dump in the wild (no runtime/createdump update needed)
  • Backward compatible (existing callers default the param to 0)
  • Preserves IP-keyed path as the primary (it's more precise for tiered methods when memory is available)

Cons:

  • Resolves @ 37 instead of @ 38 in the failing TaskNestedException case (off by one due to sequence-point boundary differences when computing IL offset on the MD-derived instance — needs investigation, or accept it and loosen the test regex)
  • Only applies when caller has a MD (i.e. exception stack-trace path); doesn't help generic GetLineByOffset(ip) callers

Validated locally: 100% → 0% failure on HeavyExn repro; CI dump now resolves [file @ 37].

B. Runtime-side: extra EECodeInfo enumeration in DAC for stack-trace IPs

Modify the DAC's DumpManagedExcepObject loop to also call EECodeInfo(stackTrace[i].ip) (using the actual return IP, not the entry). This touches the correct NibbleMap byte and pulls it into the dump.

for (size_t i = 0; i < stackTrace.Size(); i++)
{
    // ... existing ...
    if (pMD->GetNativeCode() != (PCODE)NULL)
    {
        EECodeInfo entryCodeInfo(pMD->GetNativeCode());     // existing
        if (entryCodeInfo.IsValid()) entryCodeInfo.GetMethodRegionInfo(...);

        // NEW: touch the NibbleMap byte for the actual return IP
        EECodeInfo ipCodeInfo(stackTrace[i].ip);
        if (ipCodeInfo.IsValid()) ipCodeInfo.GetMethodRegionInfo(...);
    }
    DacEnumCodeForStackwalk(stackTrace[i].ip);
}

Pros:

  • Fixes the root cause for all consumers, not just !PrintException -lines
  • Trivial code change

Cons:

  • Runtime fix → requires backport to release/9.0 (likely won't happen) and won't help dumps already produced
  • Doesn't help dumps from older runtimes

C. Test/runtime opt-in: use HEAP2 via DOTNET_EnableFastHeapDumps

Set DOTNET_EnableFastHeapDumps=1 on the debuggee for net8 / net10 test configurations. EEJitManager::EnumMemoryRegions under CLRDATA_ENUM_MEM_HEAP2 dumps every code heap's entire NibbleMap (heap->pHdrMap) wholesale, eliminating the gap.

- DOTNET_DbgEnableMiniDump=1, DOTNET_DbgMiniDumpType=2
+ DOTNET_DbgEnableMiniDump=1, DOTNET_DbgMiniDumpType=2, DOTNET_EnableFastHeapDumps=1

Pros:

  • Validated: 0% failure on net8 with this env var
  • No SOS or DAC change

Cons:

  • Doesn't fix net9g_EnableFastHeapDumps global was added in 8.0 and 10.0 but never backported to 9.0 (net9 enummem.cpp has no reference to it)
  • Doesn't fix dumps already produced by users in the field
  • Adds environment-dependent test behavior
  • Doesn't help users debugging their own production dumps

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions