zechenzhangAGI

llama-cpp

1,712
130
# Install this skill:
npx skills add zechenzhangAGI/AI-research-SKILLs --skill "llama-cpp"

Install specific skill from multi-skill repository

# Description

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

# SKILL.md


name: llama-cpp
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
dependencies: [llama-cpp-python]


llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA

Use vLLM instead when:
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention

Quick start

Installation

# macOS/Linux
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

# With ROCm (AMD)
make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

# Simple chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 256  # Max tokens

# Interactive chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --interactive

Server mode

# Start OpenAI-compatible server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 32  # Offload 32 layers to GPU

# Client request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Quantization formats

GGUF format overview

Format Bits Size (7B) Speed Quality Use Case
Q4_K_M 4.5 4.1 GB Fast Good Recommended default
Q4_K_S 4.3 3.9 GB Faster Lower Speed critical
Q5_K_M 5.5 4.8 GB Medium Better Quality critical
Q6_K 6.5 5.5 GB Slower Best Maximum quality
Q8_0 8.0 7.0 GB Slow Excellent Minimal degradation
Q2_K 2.5 2.7 GB Fastest Poor Testing only

Choosing quantization

# General use (balanced)
Q4_K_M  # 4-bit, medium quality

# Maximum speed (more degradation)
Q2_K or Q3_K_M

# Maximum quality (slower)
Q6_K or Q8_0

# Very large models (70B, 405B)
Q3_K_M or Q4_K_S  # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

# Build with Metal
make LLAMA_METAL=1

# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999  # Offload all layers

# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

# Build with CUDA
make LLAMA_CUDA=1

# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers

# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

# Build with ROCm
make LLAMA_HIP=1

# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
    -m model.gguf \
    --batch-size 512 \
    -n 100

Constrained generation

# JSON output with grammar
./llama-cli \
    -m model.gguf \
    -p "Generate a person: " \
    --grammar-file grammars/json.gbnf

# Outputs valid JSON only

Context size

# Increase context (default 512)
./llama-cli \
    -m model.gguf \
    -c 4096  # 4K context window

# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768  # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU Threads Speed Cost
Apple M3 Max 16 50 tok/s $0 (local)
AMD Ryzen 9 7950X 32 35 tok/s $0.50/hour
Intel i9-13900K 32 30 tok/s $0.40/hour
AWS c7i.16xlarge 64 40 tok/s $2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU Speed vs CPU Cost
NVIDIA RTX 4090 120 tok/s 3-4× $0 (local)
NVIDIA A10 80 tok/s 2-3× $1.00/hour
AMD MI250 70 tok/s $2.00/hour
Apple M3 Max (Metal) 50 tok/s ~Same $0 (local)

Supported models

LLaMA family:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama

Mistral family:
- Mistral 7B
- Mixtral 8x7B, 8x22B

Other:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

References

Resources

  • GitHub: https://github.com/ggerganov/llama.cpp
  • Models: https://huggingface.co/models?library=gguf
  • Discord: https://discord.gg/llama-cpp

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.