Diagnose ML training
failures in seconds.

Your training job just died. Rank 42 timed out, your entire cluster went dark, and you don't know why. Denpex tells you exactly which GPU failed first, the root cause, and the fix — before you've even opened a terminal.

11.3s

avg diagnosis time

99.7%

accuracy on 25 failure types

$847

avg GPU cost recovered

denpex diagnostic

Works with your stack

🔥PyTorch
DeepSpeed
🧠Megatron-LM
🦎Axolotl
⟨⟩FSDP

Trusted by teams training at scale

500M+

Events/day

99.7%

Accuracy on 25 failure types

11.3s

Avg time to root cause

3.2 hrs

Saved per incident avg

We ran 32-node DDP jobs that kept dying at step 12k-15k. Spent two weeks thinking it was a networking issue between our IB switches. Denpex flagged Rank 8 hitting OOM from gradient accumulation buffer growth at step 12,847. One line in deepspeed config, hasn't happened since. Still blows my mind it caught that from our NCCL timeout logs.

Senior ML Infrastructure Engineer

Our FSDP fine-tunes were failing like clockwork every Thursday. Corrupted sample in our dataset that only showed up with certain sequence lengths. Without Denpex we'd have blamed the hardware vendor for another month. It pointed directly to the dataloader. One PyTorch Dataset fix, done.

ML Platform Lead

Honestly the biggest win is not the speed. It's having something that gives ML engineers and infra the same answer. When a job crashes at 2am, nobody's arguing about whether it was the network or the code. Denpex says Rank 47 hit a CUDA OOM. Both teams look at that and move on to fixing it instead of blaming each other for four hours.

Staff ML Engineer

The debugging hell you know too well

These are the exact failures that cost teams millions in wasted GPU hours every year.

NCCL 3–8 hrs to isolate

NCCL timeout is never actually NCCL's fault

When 64 ranks hit an NCCL timeout, the framework blames the communication layer. The actual cause is almost always one rank failing — an OOM, a slow dataloader, a dead NIC — and 63 other ranks waiting at the barrier until the watchdog timer fires. You spend hours debugging InfiniBand when the problem was a corrupted data sample on node 4.

OOM 2–5 hrs and wasted compute

GPU shows 30% free but crashes with OOM

CUDA memory allocators fragment over long training runs. Your GPU reports 28GB free but can't satisfy a 2GB contiguous allocation. The error message is identical to a true OOM, so you reduce batch size, relaunch, and crash again two hours later with the same error. The fix is a single environment variable.

SDC Days of compute destroyed

Model trains for 3 days and the weights are garbage

A degraded GPU matrix-multiply unit produces slightly incorrect results — no crash, no error message, just wrong math. Because AllReduce broadcasts gradients across all ranks, one corrupted gradient poisons the entire cluster. Your loss curve stalls or diverges. You find out days later when you try to evaluate the model.

PROCESS 45–90 min per cleanup cycle

Cluster OOMs on restart because dead jobs still hold VRAM

A distributed job crashes and the framework fails to clean up child processes across all nodes. These phantom processes silently hold GPU memory. Your restart immediately crashes with OOM. You SSH into 16 nodes manually, run kill -9 on zombie processes, and then try again.

DEBUG Days, often unsolvable

Setting TORCH_DISTRIBUTED_DEBUG makes the bug disappear

You enable verbose distributed debugging on your hanging DDP job. The logging I/O alters execution timing. Your bug vanishes. You disable the flag. It crashes again. You have learned nothing. This is a Heisenbug — an observer effect created by the debug tool itself — and it afflicts nearly every multi-GPU DDP hang investigation.

NCCL 4–12 hrs, often misattributed

Slow disk I/O on one node causes NCCL timeout on all nodes

A dataloader worker on node 7 reads a corrupted or unusually large training sample. That GPU tries 400ms longer on its forward pass. The other 999 GPUs finish and wait at the AllReduce barrier. The watchdog timer fires. NCCL timeout. Your monitoring shows all nodes healthy. The error points to the network. The real culprit is a slow NVMe read on one machine.

COMPILER 3–10 hrs per compiler failure

Triton compilation failures produce 800-line C++ stack traces

torch.compile pushes your model through Triton or Inductor backends for optimization. When it fails, you get an 800-line C++ exception with no reference to your Python code. Engineers report testing models line-by-line in a Python debugger for hours trying to identify which operator triggered the compilation failure.

CHECKPOINT 18+ hrs of lost training time

DeepSpeed ZeRO-3 silently saves partial weights

ZeRO-3 and FSDP shard model weights across GPUs for memory efficiency. Saving a checkpoint requires coordinating all shards. If one rank fails or disconnects mid-save, the resulting checkpoint is silently corrupted — partial weights, missing optimizer states. You discover this 18 hours later when you try to resume.

INFRA 4–8 hrs per cross-team incident

Infra team says cluster is healthy. ML team says model is failing.

Your GPU metrics look clean — utilization 94%, temperatures normal, network healthy. But your loss curve is diverging and ranks are hanging. The hardware monitoring layer and the ML observability layer speak completely different languages. Every incident becomes a war room where infrastructure engineers and ML engineers blame each other.

REGRESSION 2–6 hrs, often unsolvable

The run that worked yesterday fails today and you don't know what changed

You didn't change your model. You didn't change your data. But your loss is diverging and last week's checkpoint won't reproduce. The culprit is usually invisible: a framework update in your Docker image, a subtly different dataset shuffle, batch size that changed with node count. You spend hours diffing configs manually. Often you never find it.

Calculate what failures are actually costing you

Based on real GPU market rates and engineering labor costs. Not inflated.

Market rates as of June 2026. Hover over GPU for contract pricing.

Training failures/month12

Meta Llama 3 averaged 7.7 failures/day on 16k GPUs

Hours to diagnose today3 hrs

89.9% of failures require 3+ hrs — Huawei Cloud 2025

Training hours lost per failure (rollback)2 hrs

Typical checkpoint retrieval: 10–20 min, restart: 1–3 hrs

ML engineers on incident2
Engineer hourly rate$150/hr
GPU-hours wasted/month
3,840
GPU-hours
Compute cost wasted/month
$9,562
at H100 SXM5 (80GB) rates
Engineering cost/month
$10,800
2 engineers × 3hrs × 12 failures
Total monthly waste
$20,362
Across 12 failures · 64 × H100 SXM5 (80GB)

ROI by plan

Team PlanRecommended
$499/mo
Net monthly gain
+$19,863
ROI multiple
40.8x
Payback period
1 days
Scale Plan
$2,499/mo
Net monthly gain
+$17,863
ROI multiple
8.1x
Payback period
4 days
Data Center
$9,999/mo
Net monthly gain
+$10,363
ROI multiple
2.0x
Payback period
15 days

Compute cost = cluster × (diagnosis hours + rollback hours) × GPU rate × failures/month. Engineering cost = engineers × diagnosis hours × rate × failures/month. Sources: Meta Llama 3 post-mortem (466 failures/54 days), ByteDance (10.4% GPU-hours wasted), Huawei Cloud 2025 (89.9% of failures require 3+ hrs).

Pricing that pays for itself before the month ends

Every plan is free for your first 3 diagnoses. No credit card.

Free

$0/ month

For individual researchers and small experiments.

  • 3 diagnoses per month
  • Manual log paste (web UI)
  • 15-type failure classification
  • Prescriptive fix output
  • 7-day history
  • 1 seat
Best for most teams

Team

$499/ month

For ML teams running regular training jobs.

  • Everything in Free, plus:
  • Unlimited diagnoses
  • Up to 64 GPUs monitored
  • Automatic Slack + email alerts
  • iMessage/SMS notifications (Twilio)
  • Multi-rank cascade analysis
  • Cross-run comparison (last 5 runs)
  • Team knowledge base (shared fixes)
  • 5 seats
  • 90-day history
Most popular

Scale

$2,499/ month

For scale-ups and serious training infrastructure.

  • Everything in Team, plus:
  • Up to 512 GPUs monitored
  • On-premise agent (logs never leave your cluster)
  • Silent data corruption (SDC) detection
  • Straggler and gray failure detection
  • Zombie process detection + auto-kill
  • Checkpoint weight delta analysis (per-layer instability trace)
  • Cross-run comparison (unlimited run history)
  • Version compatibility database (PyTorch × CUDA × cuDNN)
  • Checkpoint integrity validation
  • Unlimited seats
  • Priority support (4-hr SLA)
  • 1-year history

Data Center

$9,999/ month

For GPU cloud providers and enterprise data centers. Custom contracts available.

  • Everything in Scale, plus:
  • Unlimited GPUs
  • White-label and OEM options
  • Multi-tenant deployment
  • Dedicated Customer Success Manager
  • 99.9% uptime SLA with credits
  • GDPR, HIPAA, SOC 2 Type II compliance
  • Log PII/PHI masking (configurable)
  • Custom knowledge base ingestion
  • Integration with SLURM, Ray, Kubernetes schedulers
  • Predictive failure scoring (coming Q3 2026)
  • Auto-remediation engine (coming Q4 2026)
  • Custom contracts, invoicing, and procurement

Frequently asked questions

Do you store our training logs?

Scale and Data Center plans: The Denpex agent runs entirely within your VPC. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs. Free and Team plans: Logs are encrypted in transit and at rest, processed and deleted within 24 hours. We never store raw training data.

What frameworks do you support?

PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support is on the roadmap.

How accurate is the diagnosis?

99.7% accuracy for the 8 most common failure types (CUDA OOM, NCCL cascade, gradient explosion, NaN loss, checkpoint corruption, import error, version mismatch, device assert). For novel failures (Tier 4), Denpex falls back to an LLM-inferred diagnosis with a confidence score below 60%.

Paste your logs. Get a diagnosis in seconds.

Free. 3 diagnoses remaining.

training_logs.txt

Start diagnosing failures in seconds

Free for your first 3 diagnoses. No credit card required.

The fix arrives on your phone before you open a terminal

One message. One root cause. No noise.

When your 1,000-GPU cluster fails, Denpex doesn't send 1,000 alerts. It correlates the cascade, identifies the single root cause, masks any sensitive data, and sends one message directly to your phone with the exact fix — before you've finished your first sip of coffee.

Works with iMessage, SMS, Slack, PagerDuty, and webhook. One message per incident, always.

Data masking enabled — PII and proprietary code redacted before transmission
One message per incident — cascade correlation prevents alert storms
Respects quiet hours — urgent-only between midnight and 6am by default
9:41
DA

Denpex Alerts

iMessage

How Denpex diagnoses failures in under 12 seconds

Pattern matching for common failures. AI only when needed.

01

Multi-rank log collection

The Denpex agent collects logs and traces from ALL ranks simultaneously the moment a job fails. One line in your training script. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl.

[Rank 0]───┐
[Rank 1]───┤
[Rank 2]───┼───►denpex collector
...
[Rank 63]───┘
02

Clock-drift corrected failure ordering

Distributed nodes have NTP skew up to 500ms — timestamp-based correlation is unreliable. Denpex uses causal event ordering instead of wall-clock time to identify which rank actually failed first, filtering out 95%+ of cascade noise.

BEFORE: Wall-clock timestamps (unreliable)

Node 0: [14:07:22.441] rank 0 done

Node 1: [14:07:22.112] rank 1 done

Node 2: [14:07:22.889] rank 2 done

↑ NTP skew makes this useless

AFTER: Causal event ordering

✓ rank 17: OOM at event #4,231

✓ rank 63: waiting at barrier

✓ rank 42: waiting at barrier

Root cause: rank 17 first

03

Pattern matching against 120 known failure types

For common failures like CUDA OOM, NCCL timeout cascade, gradient explosion, weight delta anomalies, and checkpoint corruption, Denpex uses regex-based pattern matching against your logs. No AI guessing — just exact pattern recognition backed by 2,700+ real GitHub issue resolutions. You get a confidence score and the specific fix.

CUDA_OOMOOM_FRAGMENTATIONNCCL_TIMEOUTNCCL_TIMEOUT_CASCADEDEVICE_ASSERTGRADIENT_EXPLOSIONNAN_LOSSTORCH_COMPILECHECKPOINT_CORRUPTIONIMPORT_ERRORVERSION_MISMATCHSILENT_HANGDISK_FULLWEIGHT_DIVERGENCEDATA_CORRUPTION
04

For novel failures, an AI gets involved

If a failure doesn't match any known patterns, Denpex uses an LLM to analyze the logs and suggest what happened. But for the 120 most common ML training failures, you get a deterministic answer from real-world data — no AI hallucination risk.

denpex output
✓ Root cause: GRADIENT_EXPLOSION
  (layer instability, not LR)

✓ Layer: transformer.h.23.attn.c_proj
  Weight delta: +847% vs checkpoint N-1
  Instability onset: step 8,200
  Crash at: step 8,412

✓ Fix: Add gradient clipping to
  transformer.h.23 attention layers

✓ Resume: checkpoint-step-8200

✓ GPU cost recovered: $312.40

✓ Confidence: 94%

One line. No config. Works on your next failure.

train.py
import denpex

# Add before your training loop
denpex.init(
    api_key="dpx_...",
    job_name="llama3-70b-finetune",
    notify=["slack", "sms"]  # optional
)

# The rest of your training code is unchanged
trainer.train()

Built for the failures that kill production training runs

Cascade vs. root cause — Denpex knows the difference

63 NCCL timeouts don't mean 63 failures. Denpex identifies the one rank that failed first and tells you exactly what happened on that rank. The other 62 are just waiting. Stop debugging the symptom.

Before:NCCL timeout on 63 ranks → debugging 63 nodes
After:Rank 17 OOM at step 8,432 → fix one env var

Catch the failures that have no error logs

SDC produces wrong math, not crashes. Denpex monitors gradient norm consistency, loss trajectory anomalies, and rank output divergence to detect silent corruption before it destroys days of training. When a GPU's computation output deviates from expected bounds, you're alerted before the next AllReduce propagates the corruption.

Every fix your team makes makes the next engineer faster

When Alice confirms a fix for a gradient explosion at step 12,000, that pattern is ingested into your team's private knowledge base. The next time Bob hits the same failure type on the same architecture, Denpex matches it to Alice's confirmed fix in under a second. Junior engineers get senior-level diagnosis automatically.

The slow GPU that costs more than the crashed one

A thermally throttled GPU doesn't crash — it just runs 15% slower. Every other GPU waits at the synchronization barrier for it, every step. ByteDance found that 42.5% of training jobs suffer from straggler-induced slowdowns, wasting 10%+ of total GPU hours. Denpex surfaces stragglers in real time before they drain your budget.

Know which layer broke before the loss exploded

When your run crashes at step 8,412 with a gradient explosion, the instability started at step 8,200. Denpex diffs model weights between your last two checkpoints, computes per-layer magnitude deltas, and pinpoints the exact layer where parameters began diverging — hundreds of steps before the crash. You fix the architecture issue, not just the symptom.

Before:Gradient explosion at step 8,412 → reduce LR and restart from scratch
After:Layer transformer.h.23.attn — 847% weight delta at step 8,200 → add gradient clipping at that layer

The diff between your last good run and this one

Every time a run fails, Denpex automatically compares it against your last successful run of the same job — framework versions, config parameters, and gradient norm trajectory in the first 100 steps. If your gradients were already 12x higher at step 50 than last time, you know before the crash happens. No more manual config diffing across runs.

Before:Run failed → manually diff 40 config parameters across two runs → give up
After:PyTorch 2.3→2.4 upgrade changed bf16 allreduce behavior → pinpointed in 4 seconds

Built for how real clusters actually fail

466 interruptions

in 54 days

Meta Llama 3.1-405B on 16,384 H100 GPUs. 419 unexpected failures — 58.7% caused by GPU hardware (SRAM, HBM, network switches).

56%

of training time

Meta OPT-175B on 992 A100 GPUs. Ideal runtime: 25 days. Actual: 57 days. Over half of all compute time consumed by failures and recovery.

34.59%

longer job completion

ByteDance FALCON study of 3,079 training jobs. 60% of large-scale jobs (512–1,024 GPUs) experienced fail-slow events.

10.4%

GPU-hours wasted

Not from crashes — from straggler slowdowns. ByteDance: one thermally throttled GPU slows an entire cluster, every step, invisibly.

Cluster visualization (8 nodes)
0
1
2
3
4
5
6
7
Failed node
Cascade affected
Healthy

Compatible with your entire stack

Training Frameworks

  • PyTorch DDP
  • PyTorch FSDP
  • DeepSpeed ZeRO-1/2/3
  • Megatron-LM
  • Colossal-AI
  • NeMo
  • Axolotl
  • LlamaFactory
  • Unsloth

Cluster Types

  • SLURM clusters
  • Kubernetes + Ray
  • AWS SageMaker
  • GCP Vertex AI
  • Azure ML
  • Lambda Labs
  • CoreWeave
  • Vast.ai
  • Nebius
  • On-premise

Notification Channels

  • Slack
  • PagerDuty
  • iMessage/SMS (Twilio)
  • Email
  • Webhook
  • Microsoft Teams
  • Linear issues (auto-created)

Built for every team running distributed training

Stop losing training runs to failures your senior engineers used to debug in 4 hours

Research labs burn GPU budget at a rate that demands zero tolerance for manual debugging. When a 70B fine-tune dies at step 45,000 — 18 hours in — you can't afford to spend another 4 hours finding out why. Denpex tells you the root cause in seconds, with the exact fix, so you resume from checkpoint immediately instead of rerunning from scratch.

3.1 hrsSaved per incident avg
$847Avg GPU cost recovered
94%First-diagnosis accuracy

Your training logs contain your IP. We treat them that way.

PII/PHI Masking

Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.

On-Premise Option

Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.

🔒

Encryption

All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.

Compliance

Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.

What's coming next

We're building the infrastructure layer for production ML training.

Coming Q3 2027

Pre-flight cluster health scan

Run a health check before launching an expensive job. Detect stale zombie processes, GPU memory fragmentation, version incompatibilities, and network partition issues before you waste a single training step.

Coming Q3 2027

Predictive failure scoring

Machine learning on your cluster's telemetry to predict GPU degradation, thermal throttling onset, and memory leak trajectories hours before they cause a failure.

Coming Q1 2028

Auto-remediation engine

For confirmed fix types, Denpex can automatically apply the fix: set the environment variable, kill zombie processes, adjust checkpointing frequency, and trigger a checkpoint resume — without human intervention.

Coming Q2 2027

Checkpoint integrity validator

Before relying on a checkpoint saved 18 hours ago, validate that it is loadable, complete, and consistent. Prevent the worst scenario: discovering your only resume point is corrupted after a failure.

Frequently asked questions