Pricing that pays for itself before the month ends
Every plan is free for your first 3 diagnoses. No credit card.
Free
For individual researchers and small experiments.
- ✓3 diagnoses per month
- ✓Manual log paste (web UI)
- ✓15-type failure classification
- ✓Prescriptive fix output
- ✓7-day history
- ✓1 seat
Team
For ML teams running regular training jobs.
- Everything in Free, plus:
- ✓Unlimited diagnoses
- ✓Up to 64 GPUs monitored
- ✓Automatic Slack + email alerts
- ✓iMessage/SMS notifications (Twilio)
- ✓Multi-rank cascade analysis
- ✓Cross-run comparison (last 5 runs)
- ✓Team knowledge base (shared fixes)
- ✓5 seats
- ✓90-day history
Scale
For scale-ups and serious training infrastructure.
- Everything in Team, plus:
- ✓Up to 512 GPUs monitored
- ✓On-premise agent (logs never leave your cluster)
- ✓Silent data corruption (SDC) detection
- ✓Straggler and gray failure detection
- ✓Zombie process detection + auto-kill
- ✓Checkpoint weight delta analysis (per-layer instability trace)
- ✓Cross-run comparison (unlimited run history)
- ✓Version compatibility database (PyTorch × CUDA × cuDNN)
- ✓Checkpoint integrity validation
- ✓Unlimited seats
- ✓Priority support (4-hr SLA)
- ✓1-year history
Data Center
For GPU cloud providers and enterprise data centers. Custom contracts available.
- Everything in Scale, plus:
- ✓Unlimited GPUs
- ✓White-label and OEM options
- ✓Multi-tenant deployment
- ✓Dedicated Customer Success Manager
- ✓99.9% uptime SLA with credits
- ✓GDPR, HIPAA, SOC 2 Type II compliance
- ✓Log PII/PHI masking (configurable)
- ✓Custom knowledge base ingestion
- ✓Integration with SLURM, Ray, Kubernetes schedulers
- ✓Predictive failure scoring (coming Q3 2026)
- ✓Auto-remediation engine (coming Q4 2026)
- ✓Custom contracts, invoicing, and procurement
Frequently asked questions
Do you store our training logs?
Scale and Data Center plans: The Denpex agent runs entirely within your VPC. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs. Free and Team plans: Logs are encrypted in transit and at rest, processed and deleted within 24 hours. We never store raw training data.
What frameworks do you support?
PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support is on the roadmap.
How accurate is the diagnosis?
99.7% accuracy for the 8 most common failure types (CUDA OOM, NCCL cascade, gradient explosion, NaN loss, checkpoint corruption, import error, version mismatch, device assert). For novel failures (Tier 4), Denpex falls back to an LLM-inferred diagnosis with a confidence score below 60%.
Your training logs contain your IP. We treat them that way.
PII/PHI Masking
Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.
On-Premise Option
Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.
Encryption
All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.
Compliance
Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.