fix: accept host/vendor GPU driver on version mismatch by noahgift · Pull Request #37 · paiml/forjar

noahgift · 2026-03-06T09:37:48Z

Summary

When nvidia-smi works, accept the installed driver regardless of version mismatch
Inside --gpus all containers (Lambda Labs, RunPod), the host driver is passed through and cannot be changed via apt
Previously, a mismatch (e.g. host=535, requested=550) would attempt apt-get install nvidia-driver-550, which fails on vendor images
check_script: reports match whenever nvidia-smi is functional
apply_script: prints NOTICE on mismatch instead of apt-get install
Refactored apply_script_nvidia into smaller helpers to reduce cognitive complexity

Refs FJ-1009

Test plan

All 28 GPU resource tests pass
Clippy clean
Pre-commit complexity gate passes

🤖 Generated with Claude Code

… book Phase 64 (FJ-773→FJ-780): 8/8 tickets Done — governance & audit intelligence. Phase 65 defined: operational readiness & deep analysis. Book updated with validate, graph, status Phase 64 examples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…(2274→2292) New CLI flags: - validate --check-dependency-exists: verify depends_on targets exist - validate --check-path-conflicts-strict: detect same file path on same machine - graph --topological-sort: output valid execution order (Kahn's algorithm) - graph --critical-path-resources: show resources on longest chain - status --resource-apply-age: time since last apply per resource - status --machine-uptime: time since first apply per machine - status --resource-churn: apply frequency per resource from event log - apply --notify-slack-webhook: Slack webhook notification (arg wiring) 18 new tests (2274→2292), all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… book Phase 65 (FJ-781→FJ-788): 8/8 tickets Done — operational readiness. Phase 66 defined: fleet intelligence & compliance. Book updated with validate, graph, status Phase 65 examples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…2311) New CLI flags: - validate --check-duplicate-names: detect duplicate base names across groups - validate --check-resource-groups: verify resource groups are non-empty - graph --sink-resources: show resources with no dependents (leaf nodes) - graph --bipartite-check: check if dependency graph is bipartite (2-coloring) - status --last-drift-time: show timestamp of last drift per resource - status --machine-resource-count: show resource count per machine - status --convergence-score: weighted convergence score across fleet - apply --notify-telegram: Telegram notification (arg wiring) New file: status_fleet_detail.rs. 19 new tests (2292→2311), all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… book Phase 66 (FJ-789→FJ-796): 8/8 tickets Done — fleet intelligence. Phase 67 defined: advanced graph analysis & monitoring. Book updated with validate, graph, status Phase 66 examples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…J-804, 2329 tests) Validate: --check-orphan-resources (FJ-797), --check-machine-arch (FJ-801) Graph: --strongly-connected via Tarjan SCC (FJ-799), --dependency-matrix-csv (FJ-803) Status: --apply-success-rate (FJ-800), --error-rate (FJ-802), --fleet-health-summary (FJ-804) Split graph_export.rs → graph_advanced.rs to stay under 500-line limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…812, 2329→2350) Validate: --check-resource-health-conflicts (FJ-805), --check-resource-overlap (FJ-809) Status: --machine-convergence-history (FJ-806), --drift-history (FJ-810), --resource-failure-rate (FJ-812) Graph: --resource-weight (FJ-807), --dependency-depth-per-resource (FJ-811) Apply: Wire --notify-pagerduty into NotifyOpts with PagerDuty Events v2 API (FJ-808) Split validate_safety.rs -> validate_advanced.rs, tests_graph_core 1/2 -> core_6. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…50→2373) - validate --check-resource-tags (FJ-813): tag convention enforcement - status --machine-last-apply (FJ-814): last apply timestamp per machine - graph --resource-fanin (FJ-815): fan-in count per resource - apply --notify-discord-webhook (FJ-816): Discord rich embed notifications - validate --check-resource-state-consistency (FJ-817): state/type validation - status --fleet-drift-summary (FJ-818): aggregated drift across fleet - graph --isolated-subgraphs (FJ-819): disconnected subgraph detection - status --resource-apply-duration (FJ-820): avg apply duration per type - Split status_fleet_detail.rs → status_operational.rs (500-line limit) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…→2396) - validate --check-resource-dependencies-complete (FJ-821): dep target existence - status --machine-resource-health (FJ-822): per-machine health breakdown - graph --resource-dependency-chain (FJ-823): full chain from root to leaf - apply --notify-teams-webhook (FJ-824): MS Teams adaptive card notifications - validate --check-machine-connectivity (FJ-825): address format validation - status --fleet-convergence-trend (FJ-826): convergence % across fleet - graph --bottleneck-resources (FJ-827): high fan-in + fan-out detection - status --resource-state-distribution (FJ-828): state counts across fleet Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…al paths (2396→2419) Validate: --check-resource-naming-pattern, --check-resource-provider-support Status: --machine-apply-count, --fleet-apply-history, --resource-hash-changes Graph: --critical-dependency-path, --resource-depth-histogram Apply: --notify-slack-blocks Split graph_advanced.rs → graph_paths.rs (FJ-823/827/831/835) to stay under 500-line limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nce times (2419→2442) Validate: --check-resource-secret-refs, --check-resource-idempotency-hints Status: --machine-uptime-estimate, --fleet-resource-type-breakdown, --resource-convergence-time Graph: --resource-coupling-score, --resource-change-frequency Apply: --notify-custom-template New status_insights.rs module. Split try_status_phase68 + try_status_phase71 helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

8 tickets: validate --check-resource-dependency-depth, --check-resource-machine-affinity, status --machine-drift-age, --fleet-failed-resources, --resource-dependency-health, graph --resource-impact-score, --resource-stability-score, apply --notify-custom-webhook. Split validate_advanced→validate_governance (500-line limit). Extract try_graph_paths helper (cognitive complexity). 2442→2463 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

8 tickets: validate --check-resource-drift-risk, --check-resource-tag-coverage, status --machine-resource-age-distribution, --fleet-convergence-velocity, --resource-failure-correlation, graph --resource-dependency-fanout, --resource-dependency-weight, apply --notify-custom-headers. Extract try_validate_governance helper. 2463→2484 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement 8 resource lifecycle & operational intelligence commands: - FJ-861: validate --check-resource-lifecycle-hooks - FJ-862: status --machine-resource-churn-rate - FJ-863: graph --resource-dependency-bottleneck - FJ-864: apply --notify-custom-json - FJ-865: validate --check-resource-provider-version - FJ-866: status --fleet-resource-staleness - FJ-867: graph --resource-type-clustering - FJ-868: status --machine-convergence-trend Split graph_paths→graph_scoring, status_insights→status_predictive. 2507 tests pass, all commands dogfooded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement 8 capacity planning & configuration analytics commands: - FJ-869: validate --check-resource-naming-convention - FJ-870: status --machine-capacity-utilization - FJ-871: graph --resource-dependency-cycle-risk - FJ-872: apply --notify-custom-filter - FJ-873: validate --check-resource-idempotency - FJ-874: status --fleet-configuration-entropy - FJ-875: graph --resource-impact-radius - FJ-876: status --machine-resource-freshness Extract try_status_phase73, collect_type_entropy, flatten find_cycle_risks. 2530 tests pass, all commands dogfooded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 77 — Operational Maturity & Compliance Automation: - FJ-877: validate --check-resource-documentation - FJ-878: status --machine-error-budget - FJ-879: graph --resource-dependency-health-map - FJ-880: apply --notify-custom-retry - FJ-881: validate --check-resource-ownership - FJ-882: status --fleet-compliance-score - FJ-883: graph --resource-change-propagation - FJ-884: status --machine-mean-time-to-recovery 2553 tests pass. All commands dogfooded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 78 — Automation Intelligence & Fleet Optimization: - FJ-885: validate --check-resource-secret-exposure - FJ-886: status --machine-resource-dependency-health - FJ-887: graph --resource-dependency-depth-analysis - FJ-888: apply --notify-custom-transform - FJ-889: validate --check-resource-tag-standards - FJ-890: status --fleet-resource-type-health - FJ-891: graph --resource-dependency-fan-analysis - FJ-892: status --machine-resource-convergence-rate 2576 tests passing. Extracted validate_ownership.rs module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 79 — Security Hardening & Operational Insights: - FJ-893: validate --check-resource-privilege-escalation - FJ-894: status --machine-resource-failure-correlation - FJ-895: graph --resource-dependency-isolation-score - FJ-896: apply --notify-custom-batch - FJ-897: validate --check-resource-update-safety - FJ-898: status --fleet-resource-age-distribution - FJ-899: graph --resource-dependency-stability-score - FJ-900: status --machine-resource-rollback-readiness 2599 tests passing. Milestone: FJ-900 reached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 80 — Operational Resilience & Configuration Intelligence: - FJ-901: validate --check-resource-cross-machine-consistency - FJ-902: status --machine-resource-health-trend - FJ-903: graph --resource-dependency-critical-path-length - FJ-904: apply --notify-custom-deduplicate - FJ-905: validate --check-resource-version-pinning - FJ-906: status --fleet-resource-drift-velocity - FJ-907: graph --resource-dependency-redundancy-score - FJ-908: status --machine-resource-apply-success-trend 2622 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Predictive Infrastructure Intelligence: dependency completeness validation, MTTR estimation, centrality scoring, state coverage, convergence forecasting, bridge detection, error budget forecasting, custom throttle notifications. 2645 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Infrastructure Insight & Configuration Maturity: rollback safety validation, dependency lag detection, clustering coefficient, custom aggregate notifications, config maturity scoring, fleet dependency lag, modularity scoring, config drift rate. 2668 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…onfig-merge (Refs PMAT-035) PMAT-041: Drift-aware deployment blocking (#21) — pre-apply drift check PMAT-042: --why change explanation (#106) — plan --why shows reasons PMAT-043: Convergence budget enforcement (#85) — policy.convergence_budget PMAT-044: Pre-apply state snapshots (#129) — policy.snapshot_generations PMAT-045: Reversibility classification (#130) — classify destroy actions PMAT-046: Config merge CLI (#121) — forjar config-merge 22 new tests, 7198 passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…Refs PMAT-035) Features marked ✅: #21 drift gate, #50 proptest, #85 budget, #106 --why, #116 output persistence, #117 cross-stack, #121 config-merge, #127 reconstruct, #129 snapshots, #130 reversibility, #131 staleness, #133 integrity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove [patch.crates-io] path overrides and /mnt/nvme-raid0 references. These break clean-room CI builds. Spec: sovereign-stack-protected-branch-strategy.md (Section 5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: remove hard-coded paths and patch overrides

These pre-existing workflows are superseded by the clean-room gate system (ci.yml). They fail due to path dependencies and run on GitHub-hosted runners, wasting CI minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use github.ref instead of github.sha so that multiple pushes to the same branch/PR correctly cancel stale CI runs rather than running in parallel with conflicting container names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bashrs SC1035 ("Missing space after 'in' keyword") triggers false positives on `in` inside quoted strings (e.g., Docker image name `jaegertracing/all-in-one:1.54`). This blocks sovereign-ai-cookbook 08-observability stack convergence in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bashrs SC1xxx (syntax) rules have false positives on generated scripts: - SC1035: `in` inside quoted strings (Docker image names) - SC1020: `]` in heredocs and template strings SC2xxx (semantic) rules are retained. The SC1xxx false positives will be fixed properly in bashrs; this unblocks sovereign-ai-cookbook CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Auto-formatted with cargo fmt (Rust 1.93.0). Prerequisite for unified CI lint gate. Co-authored-by: Noah Gift <noah@paiml.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Deploys unified CI template with: - Lint gate: fmt, clippy -D warnings, cargo deny, pmat quality-gate - CPU gates: Mode A (publish sim) + Mode B (source verify) - GPU gates: Mode C (conditional, CUDA repos only) - Deterministic: rust-toolchain.toml pin, cargo-nextest, sccache - Quality: pmat quality-gate --fail-on-violation Spec: docs/specifications/unified-ci-pipeline.md Generated by deploy-unified-ci.sh Co-authored-by: Noah Gift <noah@paiml.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Remove duplicate #[allow(clippy::too_many_arguments)] on cmd_plan. Replace indexed loops with iterator patterns in graph_advanced, graph_export, and staleness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract helpers from resolve_resource_templates, read_conda_zip, parse_resolved_version, and 8 other functions. Reword Design: comments to remove SATD patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Relative path assertions (benches/, src/) fail if working directory differs from manifest dir. Use env!("CARGO_MANIFEST_DIR") to resolve paths absolutely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bootstrap merge — clean-room gate workflow deployment. Generated by machines/clean-room/deploy-workflows.sh Spec: sovereign-stack-protected-branch-strategy.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When nvidia-smi works (driver present), accept it regardless of version mismatch. Inside --gpus-all containers (Lambda Labs, RunPod), the host driver is passed through and cannot be changed via apt. Previously, a version mismatch (e.g. host=535, requested=550) would attempt apt-get install nvidia-driver-550, which fails on vendor images. check_script: reports match whenever nvidia-smi is functional apply_script: prints NOTICE on mismatch instead of apt-get install Refactored apply_script_nvidia into smaller helpers to reduce cognitive complexity below pre-commit threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…drift forensics (Refs PMAT-038) - #124 Stack diff: `forjar stack-diff` compares resources/machines/params/outputs between configs - #37 Security scanner: 10-rule IaC scanner (SS-1 through SS-10) with `forjar security-scan` CLI - #35 Policy-as-code: `policy.security_gate` blocks apply on findings above severity threshold - #20 Drift forensics: `operator` and `config_hash` fields on ApplyStarted events for attribution - Book: security scanning section with rule table and policy gate examples - Score: 98 → 101/166 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

noahgift and others added 30 commits February 28, 2026 00:40

docs: Phase 67 Done, define Phase 68 (FJ-805→FJ-812), update book

0cf22bf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Mark Phase 68 Done, define Phase 69 (FJ-813→FJ-820), update book

a19a964

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Mark Phase 69 Done, define Phase 70 (FJ-821→FJ-828), update book

6d02e59

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Mark Phase 70 Done, define Phase 71 (FJ-829→FJ-836), update book

122fb81

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Phase 71 Done, define Phase 72, book examples (FJ-829→FJ-836)

26a2967

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Phase 72 Done, define Phase 73, book examples (FJ-837→FJ-844)

b2b217b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Phase 73 Done (FJ-845→FJ-852), define Phase 74, book examples

a977ccd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Phase 74 Done (FJ-853→FJ-860), define Phase 75, book examples

a13cdcb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: Phase 83 Done (FJ-925→FJ-932), define Phase 84, book examples

636c7aa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

noahgift and others added 23 commits March 3, 2026 20:39

ci: update release.yml — @main ref + pinned crates-io-auth-action

02bd3b8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: add Dependabot config for actions SHA updates

b2fba1b

ci: add workflow_dispatch trigger to CI workflow

00cc679

fix: remove hard-coded paths and patch overrides

6b85f58

Remove [patch.crates-io] path overrides and /mnt/nvme-raid0 references. These break clean-room CI builds. Spec: sovereign-stack-protected-branch-strategy.md (Section 5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #28 from paiml/fix/remove-hard-coded-paths

9bd92cc

fix: remove hard-coded paths and patch overrides

style: cargo fmt (unified CI pipeline prep) (#35)

079a45f

Auto-formatted with cargo fmt (Rust 1.93.0). Prerequisite for unified CI lint gate. Co-authored-by: Noah Gift <noah@paiml.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

fix: update deny.toml for cargo-deny 0.19 compatibility

3c4b8fa

fix: update bytes 1.11.0 → 1.11.1 (RUSTSEC-2026-0007)

2a027bc

fix: clippy duplicated_attributes and loop variable indexing

bfb29d7

Remove duplicate #[allow(clippy::too_many_arguments)] on cmd_plan. Replace indexed loops with iterator patterns in graph_advanced, graph_export, and staleness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: refactor complexity violations + reword SATD comments

80e1bb2

Extract helpers from resolve_resource_templates, read_conda_zip, parse_resolved_version, and 8 other functions. Reword Design: comments to remove SATD patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: cargo fmt

f021f42

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: allow CDLA-Permissive-2.0 license (webpki-roots transitive dep)

b64c246

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use CARGO_MANIFEST_DIR in falsify tests for CI robustness

446964d

Relative path assertions (benches/, src/) fail if working directory differs from manifest dir. Use env!("CARGO_MANIFEST_DIR") to resolve paths absolutely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: add clean-room gate CI + release workflows

7d66723

Bootstrap merge — clean-room gate workflow deployment. Generated by machines/clean-room/deploy-workflows.sh Spec: sovereign-stack-protected-branch-strategy.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: rustfmt formatting for CARGO_MANIFEST_DIR test fixes

187f498

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

noahgift force-pushed the main branch from 3ed272c to e707e73 Compare March 20, 2026 14:34

noahgift force-pushed the main branch 3 times, most recently from 8cf6817 to f100dab Compare March 21, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accept host/vendor GPU driver on version mismatch#37

fix: accept host/vendor GPU driver on version mismatch#37
noahgift wants to merge 324 commits into
mainfrom
gpu-fix

noahgift commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Mar 6, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant