Pricing that pays for itself before the month ends

Every plan is free for your first 3 diagnoses. No credit card.

Free

$0/ month

For individual researchers and small experiments.

  • 3 diagnoses per month
  • Manual log paste (web UI)
  • 15-type failure classification
  • Prescriptive fix output
  • 7-day history
  • 1 seat
Best for most teams

Team

$499/ month

For ML teams running regular training jobs.

  • Everything in Free, plus:
  • Unlimited diagnoses
  • Up to 64 GPUs monitored
  • Automatic Slack + email alerts
  • iMessage/SMS notifications (Twilio)
  • Multi-rank cascade analysis
  • Cross-run comparison (last 5 runs)
  • Team knowledge base (shared fixes)
  • 5 seats
  • 90-day history
Most popular

Scale

$2,499/ month

For scale-ups and serious training infrastructure.

  • Everything in Team, plus:
  • Up to 512 GPUs monitored
  • On-premise agent (logs never leave your cluster)
  • Silent data corruption (SDC) detection
  • Straggler and gray failure detection
  • Zombie process detection + auto-kill
  • Checkpoint weight delta analysis (per-layer instability trace)
  • Cross-run comparison (unlimited run history)
  • Version compatibility database (PyTorch × CUDA × cuDNN)
  • Checkpoint integrity validation
  • Unlimited seats
  • Priority support (4-hr SLA)
  • 1-year history

Data Center

$9,999/ month

For GPU cloud providers and enterprise data centers. Custom contracts available.

  • Everything in Scale, plus:
  • Unlimited GPUs
  • White-label and OEM options
  • Multi-tenant deployment
  • Dedicated Customer Success Manager
  • 99.9% uptime SLA with credits
  • GDPR, HIPAA, SOC 2 Type II compliance
  • Log PII/PHI masking (configurable)
  • Custom knowledge base ingestion
  • Integration with SLURM, Ray, Kubernetes schedulers
  • Predictive failure scoring (coming Q3 2026)
  • Auto-remediation engine (coming Q4 2026)
  • Custom contracts, invoicing, and procurement

Frequently asked questions

Do you store our training logs?

Scale and Data Center plans: The Denpex agent runs entirely within your VPC. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs. Free and Team plans: Logs are encrypted in transit and at rest, processed and deleted within 24 hours. We never store raw training data.

What frameworks do you support?

PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support is on the roadmap.

How accurate is the diagnosis?

99.7% accuracy for the 8 most common failure types (CUDA OOM, NCCL cascade, gradient explosion, NaN loss, checkpoint corruption, import error, version mismatch, device assert). For novel failures (Tier 4), Denpex falls back to an LLM-inferred diagnosis with a confidence score below 60%.

Your training logs contain your IP. We treat them that way.

PII/PHI Masking

Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.

On-Premise Option

Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.

🔒

Encryption

All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.

Compliance

Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.

Frequently asked questions