Skip to content

feat: log environment info and IO throughput summary for debugging#490

Open
aftersnow wants to merge 5 commits intomainfrom
show-env
Open

feat: log environment info and IO throughput summary for debugging#490
aftersnow wants to merge 5 commits intomainfrom
show-env

Conversation

@aftersnow
Copy link
Copy Markdown
Contributor

@aftersnow aftersnow commented Apr 7, 2026

Summary

Add diagnostic logging to help identify performance bottlenecks in push/pull/fetch operations:

  1. Log system environment info (CPU, memory, disk, cgroup) at startup
  2. Show IO throughput summary with bottleneck indicator after each transfer

Example Output

Push (100MB to registry, tested on Linux x86_64):

IO summary: 100.00 MB in 34.2s, 2.93 MB/s | disk read: 1723.95 MB/s | network write: 2.93 MB/s ← bottleneck

Pull (50MB from registry):

IO summary: 50.00 MB in 1.2s, 40.21 MB/s | network read: 360.24 MB/s | disk write: 36.60 MB/s ← bottleneck

Log file (structured, for automated analysis):

level=info msg="io throughput summary" bottleneck="network write" disk\ read="1723.95 MB/s" effectiveThroughput="2.93 MB/s" network\ write="2.93 MB/s" operation=push totalBytes="100.00 MB" wallClock=34.163s

How It Works

Wraps the source io.Reader with a countingReader using sync/atomic counters. Each Read() call records bytes and duration. TrackTransfer() measures per-goroutine wall-clock time. After all goroutines complete:

  • Source throughput = totalBytes / cumulative source Read() time
  • Sink throughput = totalBytes / (cumulative transfer time − source read time)
  • The slower side is marked as ← bottleneck

Labels are operation-aware: push shows disk read / network write, pull/fetch shows network read / disk write.

Changes

New packages:

  • pkg/envinfo/ — CPU, memory, disk usage, cgroup v1/v2 limits (Linux), virtual FS detection
  • pkg/iometrics/Tracker, countingReader, throughput formatting

Modified files:

  • cmd/root.go — call envinfo.LogEnvironment() at startup
  • cmd/build.go, cmd/pull.go — log disk info for work directories
  • pkg/backend/push.go, pull.go, fetch.go — integrate Tracker

Test Results

Unit tests (macOS, 17 tests):

ok  github.com/modelpack/modctl/pkg/iometrics  1.394s  (12 tests)
ok  github.com/modelpack/modctl/pkg/envinfo     1.660s  (5 tests)

Covers: concurrent aggregation, timing accuracy, bottleneck detection, error propagation, zero-value safety, data integrity, virtual FS detection.

Remote machine test (Linux x86_64, Intel Xeon E5-2640 v3):

# Push 100MB random data → registry
IO summary: 100.00 MB in 34.2s, 2.93 MB/s | disk read: 1723.95 MB/s | network write: 2.93 MB/s ← bottleneck

# Pull 50MB from registry → local disk
IO summary: 50.00 MB in 1.2s, 40.21 MB/s | network read: 360.24 MB/s | disk write: 36.60 MB/s ← bottleneck

Both correctly identified the bottleneck side.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new envinfo package to log system environment details, including CPU, memory, disk usage, and cgroup limits, and adds an iometrics package to track and summarize IO throughput during push, pull, and fetch operations. The review identified several areas for improvement: tryV2 and tryV1 in cgroup_linux.go should be refactored to ensure memory limits are detected even if CPU limits are inaccessible; cpu.Percent usage in envinfo.go may return unhelpful data at startup; the isVirtualFS function appears to be dead code; and the formatBytes utility is duplicated and should be consolidated or replaced with the existing go-humanize dependency.

Comment on lines +52 to +81
cpuMax, err := os.ReadFile("/sys/fs/cgroup/cpu.max")
if err != nil {
return false
}

parts := strings.Fields(strings.TrimSpace(string(cpuMax)))
if len(parts) == 2 && parts[0] != "max" {
quota, err1 := strconv.ParseFloat(parts[0], 64)
period, err2 := strconv.ParseFloat(parts[1], 64)
if err1 == nil && err2 == nil && period > 0 {
limits.CPUQuota = quota / period
limits.InCgroup = true
}
}

memMax, err := os.ReadFile("/sys/fs/cgroup/memory.max")
if err != nil {
return limits.InCgroup
}

memStr := strings.TrimSpace(string(memMax))
if memStr != "max" {
memLimit, err := strconv.ParseUint(memStr, 10, 64)
if err == nil {
limits.MemLimit = memLimit
limits.InCgroup = true
}
}

return limits.InCgroup
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tryV2 function returns early if it fails to read the CPU quota file. This prevents the function from checking for memory limits, which might still be present even if CPU limits are not configured or accessible. It's better to attempt reading both independently to ensure comprehensive limit detection.

if cpuMax, err := os.ReadFile("/sys/fs/cgroup/cpu.max"); err == nil {
	parts := strings.Fields(strings.TrimSpace(string(cpuMax)))
	if len(parts) == 2 && parts[0] != "max" {
		quota, err1 := strconv.ParseFloat(parts[0], 64)
		period, err2 := strconv.ParseFloat(parts[1], 64)
		if err1 == nil && err2 == nil && period > 0 {
			limits.CPUQuota = quota / period
			limits.InCgroup = true
		}
	}
}

if memMax, err := os.ReadFile("/sys/fs/cgroup/memory.max"); err == nil {
	memStr := strings.TrimSpace(string(memMax))
	if memStr != "max" {
		if memLimit, err := strconv.ParseUint(memStr, 10, 64); err == nil {
			limits.MemLimit = memLimit
			limits.InCgroup = true
		}
	}
}

Comment on lines +86 to +125
quotaBytes, err := os.ReadFile("/sys/fs/cgroup/cpu/cpu.cfs_quota_us")
if err != nil {
return
}

quota, err := strconv.ParseFloat(strings.TrimSpace(string(quotaBytes)), 64)
if err != nil || quota <= 0 {
// -1 means no limit.
return
}

periodBytes, err := os.ReadFile("/sys/fs/cgroup/cpu/cpu.cfs_period_us")
if err != nil {
return
}

period, err := strconv.ParseFloat(strings.TrimSpace(string(periodBytes)), 64)
if err != nil || period <= 0 {
return
}

limits.CPUQuota = quota / period
limits.InCgroup = true

// cgroup v1: Memory limit.
memBytes, err := os.ReadFile("/sys/fs/cgroup/memory/memory.limit_in_bytes")
if err != nil {
return
}

memLimit, err := strconv.ParseUint(strings.TrimSpace(string(memBytes)), 10, 64)
if err != nil {
return
}

// Very large values (like 2^63) indicate no limit.
const noLimitThreshold = 1 << 62
if memLimit < noLimitThreshold {
limits.MemLimit = memLimit
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In tryV1, the function returns early if any CPU-related file is missing or if the quota is not set, which skips the memory limit detection. Additionally, limits.InCgroup should be set to true if a memory limit is successfully detected, even if no CPU limit is found.

if quotaBytes, err := os.ReadFile("/sys/fs/cgroup/cpu/cpu.cfs_quota_us"); err == nil {
	if quota, err := strconv.ParseFloat(strings.TrimSpace(string(quotaBytes)), 64); err == nil && quota > 0 {
		if periodBytes, err := os.ReadFile("/sys/fs/cgroup/cpu/cpu.cfs_period_us"); err == nil {
			if period, err := strconv.ParseFloat(strings.TrimSpace(string(periodBytes)), 64); err == nil && period > 0 {
				limits.CPUQuota = quota / period
				limits.InCgroup = true
			}
		}
	}
}

// cgroup v1: Memory limit.
if memBytes, err := os.ReadFile("/sys/fs/cgroup/memory/memory.limit_in_bytes"); err == nil {
	if memLimit, err := strconv.ParseUint(strings.TrimSpace(string(memBytes)), 10, 64); err == nil {
		// Very large values (like 2^63) indicate no limit.
		const noLimitThreshold = 1 << 62
		if memLimit < noLimitThreshold {
			limits.MemLimit = memLimit
			limits.InCgroup = true
		}
	}
}

fields["model"] = infos[0].ModelName
}

percents, err := cpu.Percent(0, false)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

cpu.Percent(0, false) returns the CPU usage since the last call. When called for the first time at process startup with a 0 interval, it typically returns 0.0% or an error, which doesn't provide useful information about the system's current load. Consider using a small interval (e.g., 100ms) to get a meaningful snapshot, or omit this field if startup delay is a concern.

Comment on lines +144 to +168
func isVirtualFS(fstype, device string) bool {
fstypeLower := strings.ToLower(fstype)

virtualFSTypes := []string{
"fuse", "nfs", "cifs", "smb", "smbfs",
"tmpfs", "ramfs", "devtmpfs",
"overlay", "aufs",
"sshfs", "s3fs", "gcsfuse", "ossfs",
"9p", "virtiofs",
}

for _, vfs := range virtualFSTypes {
if fstypeLower == vfs || strings.HasPrefix(fstypeLower, vfs+".") {
return true
}
}

// Devices not under /dev/ are generally not block devices.
// e.g., "s3fs", "sshfs#user@host:", "server:/export"
if device != "" && !strings.HasPrefix(device, "/dev/") {
return true
}

return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The isVirtualFS function is defined and tested but appears to be unused in the current implementation. If it was intended to filter disk statistics in LogDiskInfo or elsewhere, please integrate it; otherwise, it should be removed to avoid dead code.

Comment on lines +170 to +190
func formatBytes(b uint64) string {
const (
KB = 1024
MB = KB * 1024
GB = MB * 1024
TB = GB * 1024
)

switch {
case b >= TB:
return fmt.Sprintf("%.2f TB", float64(b)/float64(TB))
case b >= GB:
return fmt.Sprintf("%.2f GB", float64(b)/float64(GB))
case b >= MB:
return fmt.Sprintf("%.2f MB", float64(b)/float64(MB))
case b >= KB:
return fmt.Sprintf("%.2f KB", float64(b)/float64(KB))
default:
return fmt.Sprintf("%d B", b)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The formatBytes function is implemented twice (here and in pkg/iometrics/format.go). Furthermore, the project already depends on github.com/dustin/go-humanize, which provides robust and well-tested functions for formatting byte sizes (e.g., humanize.Bytes). Consider consolidating this logic or using the existing dependency.

@aftersnow aftersnow changed the title feat: log environment info and IO throughput summary feat: log environment info and IO throughput summary for debugging Apr 7, 2026
Add envinfo package that logs build version, runtime, CPU, memory, and
disk usage information at startup. On Linux, cgroup CPU/memory limits
are also logged to help diagnose container resource constraints.

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Add pkg/iometrics package with a Tracker that wraps source io.Reader
using atomic counters to measure bytes and cumulative Read() call
duration. TrackTransfer() measures per-goroutine wall-clock time.

After each operation completes, a summary is output to both terminal
(stderr) and log file showing effective throughput, source read
throughput, and readFraction (sourceReadTime / transferTime) to help
identify whether the bottleneck is in source reading or sink writing.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Replace generic "source/effective" labels with operation-aware labels
(push: disk read / network write, pull: network read / disk write).
Derive sink throughput from transferTime - sourceReadTime. Mark the
slower side as bottleneck. Add overall effective throughput to output.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
- Fix cgroup v1/v2: detect CPU and memory limits independently so
  one missing limit doesn't prevent detecting the other
- Remove cpu.Percent(0) from startup log — unreliable at process start
- Remove unused isVirtualFS function (dead code after IO removal)
- Replace duplicated formatBytes with humanize.IBytes from go-humanize

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
- Read /proc/self/cgroup to find the process's actual cgroup path
  before reading limit files, so containers in k8s/Docker sub-cgroups
  get the correct CPU/memory limits instead of root cgroup values
- Wrap manifest bytes.NewReader with tracker.WrapReader so push
  summaries include manifest bytes in totalBytes

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant