Diagnose ML training
failures in seconds.
Your training job just died. Rank 42 timed out, your entire cluster went dark, and you don't know why. Denpex tells you exactly which GPU failed first, the root cause, and the fix — before you've even opened a terminal.
11.3s
avg diagnosis time
99.7%
accuracy on 25 failure types
$847
avg GPU cost recovered
Works with your stack
Trusted by teams training at scale
500M+
Events/day
99.7%
Accuracy on 25 failure types
11.3s
Avg time to root cause
3.2 hrs
Saved per incident avg
“We ran 32-node DDP jobs that kept dying at step 12k-15k. Spent two weeks thinking it was a networking issue between our IB switches. Denpex flagged Rank 8 hitting OOM from gradient accumulation buffer growth at step 12,847. One line in deepspeed config, hasn't happened since. Still blows my mind it caught that from our NCCL timeout logs.”
“Our FSDP fine-tunes were failing like clockwork every Thursday. Corrupted sample in our dataset that only showed up with certain sequence lengths. Without Denpex we'd have blamed the hardware vendor for another month. It pointed directly to the dataloader. One PyTorch Dataset fix, done.”
“Honestly the biggest win is not the speed. It's having something that gives ML engineers and infra the same answer. When a job crashes at 2am, nobody's arguing about whether it was the network or the code. Denpex says Rank 47 hit a CUDA OOM. Both teams look at that and move on to fixing it instead of blaming each other for four hours.”
The debugging hell you know too well
These are the exact failures that cost teams millions in wasted GPU hours every year.
NCCL timeout is never actually NCCL's fault
When 64 ranks hit an NCCL timeout, the framework blames the communication layer. The actual cause is almost always one rank failing — an OOM, a slow dataloader, a dead NIC — and 63 other ranks waiting at the barrier until the watchdog timer fires. You spend hours debugging InfiniBand when the problem was a corrupted data sample on node 4.
GPU shows 30% free but crashes with OOM
CUDA memory allocators fragment over long training runs. Your GPU reports 28GB free but can't satisfy a 2GB contiguous allocation. The error message is identical to a true OOM, so you reduce batch size, relaunch, and crash again two hours later with the same error. The fix is a single environment variable.
Model trains for 3 days and the weights are garbage
A degraded GPU matrix-multiply unit produces slightly incorrect results — no crash, no error message, just wrong math. Because AllReduce broadcasts gradients across all ranks, one corrupted gradient poisons the entire cluster. Your loss curve stalls or diverges. You find out days later when you try to evaluate the model.
Cluster OOMs on restart because dead jobs still hold VRAM
A distributed job crashes and the framework fails to clean up child processes across all nodes. These phantom processes silently hold GPU memory. Your restart immediately crashes with OOM. You SSH into 16 nodes manually, run kill -9 on zombie processes, and then try again.
Setting TORCH_DISTRIBUTED_DEBUG makes the bug disappear
You enable verbose distributed debugging on your hanging DDP job. The logging I/O alters execution timing. Your bug vanishes. You disable the flag. It crashes again. You have learned nothing. This is a Heisenbug — an observer effect created by the debug tool itself — and it afflicts nearly every multi-GPU DDP hang investigation.
Slow disk I/O on one node causes NCCL timeout on all nodes
A dataloader worker on node 7 reads a corrupted or unusually large training sample. That GPU tries 400ms longer on its forward pass. The other 999 GPUs finish and wait at the AllReduce barrier. The watchdog timer fires. NCCL timeout. Your monitoring shows all nodes healthy. The error points to the network. The real culprit is a slow NVMe read on one machine.
Triton compilation failures produce 800-line C++ stack traces
torch.compile pushes your model through Triton or Inductor backends for optimization. When it fails, you get an 800-line C++ exception with no reference to your Python code. Engineers report testing models line-by-line in a Python debugger for hours trying to identify which operator triggered the compilation failure.
DeepSpeed ZeRO-3 silently saves partial weights
ZeRO-3 and FSDP shard model weights across GPUs for memory efficiency. Saving a checkpoint requires coordinating all shards. If one rank fails or disconnects mid-save, the resulting checkpoint is silently corrupted — partial weights, missing optimizer states. You discover this 18 hours later when you try to resume.
Infra team says cluster is healthy. ML team says model is failing.
Your GPU metrics look clean — utilization 94%, temperatures normal, network healthy. But your loss curve is diverging and ranks are hanging. The hardware monitoring layer and the ML observability layer speak completely different languages. Every incident becomes a war room where infrastructure engineers and ML engineers blame each other.
The run that worked yesterday fails today and you don't know what changed
You didn't change your model. You didn't change your data. But your loss is diverging and last week's checkpoint won't reproduce. The culprit is usually invisible: a framework update in your Docker image, a subtly different dataset shuffle, batch size that changed with node count. You spend hours diffing configs manually. Often you never find it.
Calculate what failures are actually costing you
Based on real GPU market rates and engineering labor costs. Not inflated.
Market rates as of June 2026. Hover over GPU for contract pricing.
Meta Llama 3 averaged 7.7 failures/day on 16k GPUs
89.9% of failures require 3+ hrs — Huawei Cloud 2025
Typical checkpoint retrieval: 10–20 min, restart: 1–3 hrs
ROI by plan
Compute cost = cluster × (diagnosis hours + rollback hours) × GPU rate × failures/month. Engineering cost = engineers × diagnosis hours × rate × failures/month. Sources: Meta Llama 3 post-mortem (466 failures/54 days), ByteDance (10.4% GPU-hours wasted), Huawei Cloud 2025 (89.9% of failures require 3+ hrs).
Pricing that pays for itself before the month ends
Every plan is free for your first 3 diagnoses. No credit card.
Free
For individual researchers and small experiments.
- ✓3 diagnoses per month
- ✓Manual log paste (web UI)
- ✓15-type failure classification
- ✓Prescriptive fix output
- ✓7-day history
- ✓1 seat
Team
For ML teams running regular training jobs.
- Everything in Free, plus:
- ✓Unlimited diagnoses
- ✓Up to 64 GPUs monitored
- ✓Automatic Slack + email alerts
- ✓iMessage/SMS notifications (Twilio)
- ✓Multi-rank cascade analysis
- ✓Cross-run comparison (last 5 runs)
- ✓Team knowledge base (shared fixes)
- ✓5 seats
- ✓90-day history
Scale
For scale-ups and serious training infrastructure.
- Everything in Team, plus:
- ✓Up to 512 GPUs monitored
- ✓On-premise agent (logs never leave your cluster)
- ✓Silent data corruption (SDC) detection
- ✓Straggler and gray failure detection
- ✓Zombie process detection + auto-kill
- ✓Checkpoint weight delta analysis (per-layer instability trace)
- ✓Cross-run comparison (unlimited run history)
- ✓Version compatibility database (PyTorch × CUDA × cuDNN)
- ✓Checkpoint integrity validation
- ✓Unlimited seats
- ✓Priority support (4-hr SLA)
- ✓1-year history
Data Center
For GPU cloud providers and enterprise data centers. Custom contracts available.
- Everything in Scale, plus:
- ✓Unlimited GPUs
- ✓White-label and OEM options
- ✓Multi-tenant deployment
- ✓Dedicated Customer Success Manager
- ✓99.9% uptime SLA with credits
- ✓GDPR, HIPAA, SOC 2 Type II compliance
- ✓Log PII/PHI masking (configurable)
- ✓Custom knowledge base ingestion
- ✓Integration with SLURM, Ray, Kubernetes schedulers
- ✓Predictive failure scoring (coming Q3 2026)
- ✓Auto-remediation engine (coming Q4 2026)
- ✓Custom contracts, invoicing, and procurement
Frequently asked questions
Do you store our training logs?
Scale and Data Center plans: The Denpex agent runs entirely within your VPC. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs. Free and Team plans: Logs are encrypted in transit and at rest, processed and deleted within 24 hours. We never store raw training data.
What frameworks do you support?
PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support is on the roadmap.
How accurate is the diagnosis?
99.7% accuracy for the 8 most common failure types (CUDA OOM, NCCL cascade, gradient explosion, NaN loss, checkpoint corruption, import error, version mismatch, device assert). For novel failures (Tier 4), Denpex falls back to an LLM-inferred diagnosis with a confidence score below 60%.
Paste your logs. Get a diagnosis in seconds.
Free. 3 diagnoses remaining.
Start diagnosing failures in seconds
Free for your first 3 diagnoses. No credit card required.
The fix arrives on your phone before you open a terminal
One message. One root cause. No noise.
When your 1,000-GPU cluster fails, Denpex doesn't send 1,000 alerts. It correlates the cascade, identifies the single root cause, masks any sensitive data, and sends one message directly to your phone with the exact fix — before you've finished your first sip of coffee.
Works with iMessage, SMS, Slack, PagerDuty, and webhook. One message per incident, always.
Denpex Alerts
How Denpex diagnoses failures in under 12 seconds
Pattern matching for common failures. AI only when needed.
Multi-rank log collection
The Denpex agent collects logs and traces from ALL ranks simultaneously the moment a job fails. One line in your training script. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl.
Clock-drift corrected failure ordering
Distributed nodes have NTP skew up to 500ms — timestamp-based correlation is unreliable. Denpex uses causal event ordering instead of wall-clock time to identify which rank actually failed first, filtering out 95%+ of cascade noise.
BEFORE: Wall-clock timestamps (unreliable)
Node 0: [14:07:22.441] rank 0 done
Node 1: [14:07:22.112] rank 1 done
Node 2: [14:07:22.889] rank 2 done
↑ NTP skew makes this useless
AFTER: Causal event ordering
✓ rank 17: OOM at event #4,231
✓ rank 63: waiting at barrier
✓ rank 42: waiting at barrier
Root cause: rank 17 first
Pattern matching against 120 known failure types
For common failures like CUDA OOM, NCCL timeout cascade, gradient explosion, weight delta anomalies, and checkpoint corruption, Denpex uses regex-based pattern matching against your logs. No AI guessing — just exact pattern recognition backed by 2,700+ real GitHub issue resolutions. You get a confidence score and the specific fix.
For novel failures, an AI gets involved
If a failure doesn't match any known patterns, Denpex uses an LLM to analyze the logs and suggest what happened. But for the 120 most common ML training failures, you get a deterministic answer from real-world data — no AI hallucination risk.
✓ Root cause: GRADIENT_EXPLOSION (layer instability, not LR) ✓ Layer: transformer.h.23.attn.c_proj Weight delta: +847% vs checkpoint N-1 Instability onset: step 8,200 Crash at: step 8,412 ✓ Fix: Add gradient clipping to transformer.h.23 attention layers ✓ Resume: checkpoint-step-8200 ✓ GPU cost recovered: $312.40 ✓ Confidence: 94%
One line. No config. Works on your next failure.
import denpex
# Add before your training loop
denpex.init(
api_key="dpx_...",
job_name="llama3-70b-finetune",
notify=["slack", "sms"] # optional
)
# The rest of your training code is unchanged
trainer.train()Built for the failures that kill production training runs
Cascade vs. root cause — Denpex knows the difference
63 NCCL timeouts don't mean 63 failures. Denpex identifies the one rank that failed first and tells you exactly what happened on that rank. The other 62 are just waiting. Stop debugging the symptom.
Catch the failures that have no error logs
SDC produces wrong math, not crashes. Denpex monitors gradient norm consistency, loss trajectory anomalies, and rank output divergence to detect silent corruption before it destroys days of training. When a GPU's computation output deviates from expected bounds, you're alerted before the next AllReduce propagates the corruption.
Every fix your team makes makes the next engineer faster
When Alice confirms a fix for a gradient explosion at step 12,000, that pattern is ingested into your team's private knowledge base. The next time Bob hits the same failure type on the same architecture, Denpex matches it to Alice's confirmed fix in under a second. Junior engineers get senior-level diagnosis automatically.
The slow GPU that costs more than the crashed one
A thermally throttled GPU doesn't crash — it just runs 15% slower. Every other GPU waits at the synchronization barrier for it, every step. ByteDance found that 42.5% of training jobs suffer from straggler-induced slowdowns, wasting 10%+ of total GPU hours. Denpex surfaces stragglers in real time before they drain your budget.
Know which layer broke before the loss exploded
When your run crashes at step 8,412 with a gradient explosion, the instability started at step 8,200. Denpex diffs model weights between your last two checkpoints, computes per-layer magnitude deltas, and pinpoints the exact layer where parameters began diverging — hundreds of steps before the crash. You fix the architecture issue, not just the symptom.
The diff between your last good run and this one
Every time a run fails, Denpex automatically compares it against your last successful run of the same job — framework versions, config parameters, and gradient norm trajectory in the first 100 steps. If your gradients were already 12x higher at step 50 than last time, you know before the crash happens. No more manual config diffing across runs.
Built for how real clusters actually fail
466 interruptions
in 54 days
Meta Llama 3.1-405B on 16,384 H100 GPUs. 419 unexpected failures — 58.7% caused by GPU hardware (SRAM, HBM, network switches).
56%
of training time
Meta OPT-175B on 992 A100 GPUs. Ideal runtime: 25 days. Actual: 57 days. Over half of all compute time consumed by failures and recovery.
34.59%
longer job completion
ByteDance FALCON study of 3,079 training jobs. 60% of large-scale jobs (512–1,024 GPUs) experienced fail-slow events.
10.4%
GPU-hours wasted
Not from crashes — from straggler slowdowns. ByteDance: one thermally throttled GPU slows an entire cluster, every step, invisibly.
Compatible with your entire stack
Training Frameworks
- PyTorch DDP
- PyTorch FSDP
- DeepSpeed ZeRO-1/2/3
- Megatron-LM
- Colossal-AI
- NeMo
- Axolotl
- LlamaFactory
- Unsloth
Cluster Types
- SLURM clusters
- Kubernetes + Ray
- AWS SageMaker
- GCP Vertex AI
- Azure ML
- Lambda Labs
- CoreWeave
- Vast.ai
- Nebius
- On-premise
Notification Channels
- Slack
- PagerDuty
- iMessage/SMS (Twilio)
- Webhook
- Microsoft Teams
- Linear issues (auto-created)
Built for every team running distributed training
Stop losing training runs to failures your senior engineers used to debug in 4 hours
Research labs burn GPU budget at a rate that demands zero tolerance for manual debugging. When a 70B fine-tune dies at step 45,000 — 18 hours in — you can't afford to spend another 4 hours finding out why. Denpex tells you the root cause in seconds, with the exact fix, so you resume from checkpoint immediately instead of rerunning from scratch.
Your training logs contain your IP. We treat them that way.
PII/PHI Masking
Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.
On-Premise Option
Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.
Encryption
All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.
Compliance
Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.
What's coming next
We're building the infrastructure layer for production ML training.
Pre-flight cluster health scan
Run a health check before launching an expensive job. Detect stale zombie processes, GPU memory fragmentation, version incompatibilities, and network partition issues before you waste a single training step.
Predictive failure scoring
Machine learning on your cluster's telemetry to predict GPU degradation, thermal throttling onset, and memory leak trajectories hours before they cause a failure.
Auto-remediation engine
For confirmed fix types, Denpex can automatically apply the fix: set the environment variable, kill zombie processes, adjust checkpointing frequency, and trigger a checkpoint resume — without human intervention.
Checkpoint integrity validator
Before relying on a checkpoint saved 18 hours ago, validate that it is loadable, complete, and consistent. Prevent the worst scenario: discovering your only resume point is corrupted after a failure.