Skip to content

LLM Models Guide

TopicScanner uses different LLM models for different pipeline tasks. This guide covers which models are needed, their hardware requirements, and how to set them up.

Pipeline Tasks & Model Requirements

TopicScanner uses 5 specialized tasks, each with its own model configuration:

Task Purpose Default (Ollama) Min VRAM Fallback (Cloud)
Relevance Score topic relevance to categories (0.0–1.0) qwen3:14b ~10 GB gpt-4o-mini
Classification Categorize topics into themes qwen3:14b ~10 GB gpt-4o-mini
Summarization Summarize extracted content qwen3:32b ~20 GB gpt-4o
Embedding Generate vector embeddings for dedup nomic-embed-text ~0.5 GB text-embedding-3-small
Generation Generate blog posts, scripts, newsletters qwen3:32b ~20 GB gpt-4o

Ollama Setup (Primary Provider)

1. Install Ollama

# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

2. Pull Required Models

Pull all 3 models (qwen3:14b, qwen3:32b, nomic-embed-text):

# Embedding model (small, fast — pull first)
ollama pull nomic-embed-text

# Relevance + Classification model (14B parameters)
ollama pull qwen3:14b

# Summarization + Generation model (32B parameters)
ollama pull qwen3:32b

3. Verify Models

ollama list
# Should show:
# NAME                ID          SIZE    MODIFIED
# qwen3:14b         ...         9.0 GB  ...
# qwen3:32b         ...         19 GB   ...
# nomic-embed-text    ...         274 MB  ...

4. Test Models

# Test chat model
ollama run qwen3:14b "What is Kubernetes?"

# Test embedding model
curl http://localhost:11434/api/embed -d '{"model":"nomic-embed-text","input":"test"}'

Hardware Requirements

Minimum (relevance + classification + embedding only)

  • GPU: 12 GB VRAM (e.g., RTX 3060 12GB, RTX 4070)
  • RAM: 16 GB system RAM
  • Models: qwen3:14b + nomic-embed-text
  • Note: Summarization and generation will fall back to cloud provider
  • GPU: 24 GB VRAM (e.g., RTX 3090, RTX 4090, A5000)
  • RAM: 32 GB system RAM
  • Models: All 3 models
  • Note: 32B model requires ~20 GB VRAM; Ollama swaps to RAM if VRAM insufficient (slower)

Apple Silicon

  • M1/M2 Pro (16 GB): Can run 14B + embedding. 32B will be slow (memory pressure).
  • M1/M2 Max (32 GB): All models run well. Unified memory is shared with GPU.
  • M1/M2 Ultra (64+ GB): Ideal. Can run multiple models concurrently.
  • M3/M4 variants: Same guidelines based on unified memory.

CPU-Only (No GPU)

Ollama can run on CPU but will be significantly slower:

  • nomic-embed-text: Fast on CPU (~100ms per embedding)
  • qwen3:14b: Usable (~10-30s per response)
  • qwen3:32b: Very slow (~60-120s per response) — recommend cloud fallback

Kubernetes Deployment

When running Ollama in Kubernetes, you have two options:

Run Ollama on a GPU machine accessible from the cluster:

# Helm values
llm:
  provider: ollama
  ollama:
    url: http://ollama-host.internal:11434

Option B: Ollama in Kubernetes

Deploy Ollama as a pod with GPU access:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: 1  # Requires NVIDIA device plugin
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi  # Enough for all 3 models

Then pull models inside the pod:

kubectl exec -it deploy/ollama -- ollama pull nomic-embed-text
kubectl exec -it deploy/ollama -- ollama pull qwen3:14b
kubectl exec -it deploy/ollama -- ollama pull qwen3:32b

Alternative Models

You can swap models by overriding environment variables:

Task Env Variable Alternatives
Relevance LLM_OLLAMA_MODEL_RELEVANCE llama3.1:8b, mistral:7b, gemma2:9b
Classification LLM_OLLAMA_MODEL_CLASSIFICATION llama3.1:8b, mistral:7b, gemma2:9b
Summarization LLM_OLLAMA_MODEL_SUMMARIZATION llama3.1:70b, qwen3:72b, mixtral:8x7b
Embedding LLM_OLLAMA_MODEL_EMBEDDING mxbai-embed-large, all-minilm
Generation LLM_OLLAMA_MODEL_GENERATION llama3.1:70b, qwen3:72b, mixtral:8x7b

Model Selection Tips

  • Smaller models (7-8B): Faster, less VRAM, slightly lower quality. Good for relevance/classification.
  • Larger models (32-72B): Better quality for summarization and content generation. Need more VRAM.
  • Embedding models: nomic-embed-text is the best balance of quality/speed. mxbai-embed-large is slightly better but larger.
  • Quantization: Ollama models are quantized (Q4_K_M by default). For better quality at the cost of more VRAM, use qwen3:14b-instruct-fp16.

Cloud Fallback Configuration

When Ollama is unavailable or a task fails, TopicScanner falls back to a cloud provider:

OpenAI

# Helm values
llm:
  fallback:
    enabled: true
    provider: openai
    apiKey: "sk-..."

Or via environment:

LLM_CLOUD_FALLBACK=openai
OPENAI_API_KEY=sk-...

Anthropic (Claude)

LLM_CLOUD_FALLBACK=claude
ANTHROPIC_API_KEY=sk-ant-...

Cost Estimates (Cloud Fallback)

Assuming ~1000 topics scanned per day:

Task Model Est. Tokens/Day Est. Cost/Day
Relevance gpt-4o-mini ~500K ~$0.08
Classification gpt-4o-mini ~300K ~$0.05
Summarization gpt-4o ~2M ~$10.00
Embedding text-embedding-3-small ~1M ~$0.02
Generation gpt-4o ~3M ~$15.00
Total ~$25/day

Note: Running locally with Ollama costs $0/day after hardware investment. Cloud fallback is for reliability, not primary use.

Troubleshooting

Ollama Not Responding

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Restart Ollama
systemctl restart ollama  # Linux
brew services restart ollama  # macOS

Model Too Slow

  • Check GPU utilization: nvidia-smi (NVIDIA) or sudo powermetrics --samplers gpu_power (macOS)
  • If running on CPU, consider using smaller models or enabling cloud fallback
  • Reduce max-results-per-scan to process fewer topics per cycle

Out of Memory

  • Use smaller models: replace qwen3:32b with qwen3:14b for summarization/generation
  • Enable cloud fallback for the heavy tasks (summarization, generation) and run only relevance/classification/embedding locally
  • Set OLLAMA_NUM_PARALLEL=1 to prevent concurrent model loading

Embedding Dimension Mismatch

If you switch embedding models, the vector dimensions must match the database:

# Helm values — must match your embedding model's output dimensions
pgvector:
  dimensions: 768  # nomic-embed-text = 768, text-embedding-3-small = 1536

If you change models, you need to re-embed all existing topics or reset the embeddings table.