Configuration¶

Application Configuration¶

The pipeline-service reads configuration from application.yml. All values can be overridden via environment variables.

Pipeline Settings¶

topicscanner:
  pipeline:
    scan-cron: "0 0 6 * * *"          # When to run daily scans
    poll-interval-seconds: 30           # Job queue polling interval
    max-results-per-scan: 25            # Max topics per source per scan
    stale-job-minutes: 10               # Reset stuck jobs after this
    relevance-threshold: 0.7            # LLM relevance score cutoff (0.0–1.0)
    quality-threshold: 0.5              # Quality score cutoff (0.0–1.0)
    min-content-length: 200             # Minimum extracted content chars
    dedup-similarity-threshold: 0.95    # Embedding cosine similarity cutoff
    max-retries: 3                      # Max retry attempts per job

Env Variable	Config Key	Default
`PIPELINE_SCAN_CRON`	`topicscanner.pipeline.scan-cron`	`0 0 6 * * *`
`PIPELINE_POLL_INTERVAL`	`topicscanner.pipeline.poll-interval-seconds`	`30`
`PIPELINE_MAX_RESULTS_PER_SCAN`	`topicscanner.pipeline.max-results-per-scan`	`25`
`PIPELINE_STALE_JOB_MINUTES`	`topicscanner.pipeline.stale-job-minutes`	`10`
`PIPELINE_RELEVANCE_THRESHOLD`	`topicscanner.pipeline.relevance-threshold`	`0.7`
`PIPELINE_QUALITY_THRESHOLD`	`topicscanner.pipeline.quality-threshold`	`0.5`
`PIPELINE_MIN_CONTENT_LENGTH`	`topicscanner.pipeline.min-content-length`	`200`
`PIPELINE_DEDUP_THRESHOLD`	`topicscanner.pipeline.dedup-similarity-threshold`	`0.95`

LLM Settings¶

Each LLM provider supports task-specific models — use a cheap model for scoring and a powerful model for generation.

topicscanner:
  llm:
    primary: ollama                     # Primary provider: ollama | openai | claude
    cloud-fallback: openai              # Fallback when primary fails

    ollama:
      url: http://localhost:11434
      timeout-seconds: 120
      models:
        relevance: qwen2.5:14b
        classification: qwen2.5:14b
        summarization: qwen2.5:32b
        embedding: nomic-embed-text
        generation: qwen2.5:32b

    openai:
      api-key: ${OPENAI_API_KEY:}
      timeout-seconds: 60
      models:
        relevance: gpt-4o-mini
        classification: gpt-4o-mini
        summarization: gpt-4o
        embedding: text-embedding-3-small
        generation: gpt-4o

    claude:
      api-key: ${ANTHROPIC_API_KEY:}
      timeout-seconds: 60
      models:
        relevance: claude-haiku-4-5-20251001
        classification: claude-haiku-4-5-20251001
        summarization: claude-sonnet-4-6
        generation: claude-sonnet-4-6

Env Variable	Description
`LLM_PRIMARY`	Primary LLM provider
`LLM_CLOUD_FALLBACK`	Fallback provider
`LLM_OLLAMA_URL`	Ollama server URL
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic Claude API key

Note

Claude does not support embeddings. When using Claude as primary, configure Ollama or OpenAI as fallback for embedding tasks.

Database Settings¶

spring:
  datasource:
    url: ${DATABASE_URL:jdbc:postgresql://localhost:5432/topicscanner}
    username: ${DATABASE_USERNAME:topicscanner}
    password: ${DATABASE_PASSWORD:topicscanner}

Scanner Plugin Directory¶

topicscanner:
  scanner:
    plugins-dir: ${SCANNER_PLUGINS_DIR:plugins}

Helm Values¶

See Deployment for the full Helm values reference.

Key values:

Value	Description	Default
`llm.provider`	LLM provider	`ollama`
`llm.model`	Model name	`llama3`
`llm.apiKey`	API key	`""`
`llm.ollama.url`	Ollama URL	`http://ollama:11434`
`llm.fallback.enabled`	Enable fallback LLM	`false`
`pgvector.enabled`	Enable embeddings	`true`
`pgvector.dimensions`	Vector dimensions	`1536`
`scanners.reddit.enabled`	Enable Reddit	`false`
`scanners.stackoverflow.enabled`	Enable StackOverflow	`true`
`scanners.medium.enabled`	Enable Medium	`true`
`scanners.devto.enabled`	Enable Dev.to	`true`
`scanners.hashnode.enabled`	Enable Hashnode	`true`
`scanners.youtube.enabled`	Enable YouTube	`false`