Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism,...
AI & LLM
LLM integrations, prompt engineering, and AI orchestration
Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies....
Expert guidance for Fully Sharded Data Parallel training with PyTorch FSDP - parameter sharding, mixed precision,...
High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks...
Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of...
Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances...
Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without...
Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or...
Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when...
Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is...
Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when...
GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer...
Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on...
Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision...
Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when...
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking...
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend...
Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment,...
Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs,...
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production...
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production...
Track ML experiments, manage model registry with versioning, deploy models to production, and reproduce experiments...
Visualize training metrics, debug models with histograms, compare experiments, visualize model graphs, and profile...
Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps,...