open source · python · stdlib http · mit

one prompt. two models. one diff.

see which llm does your task better. tokens, latency, cost, all lined up. run both, eyeball it, pick one. lol that's it.

install →

       prompt                              prompt
         ↓                                   ↓
 ┌─────────────────┐ │ ┌─────────────────┐
 │ openai/gpt-4o    │ │ │ anthropic/claude │
 │                 │ │ │                 │
 │ response ≈≈≈≈≈≈  │ │ │ response ≈≈≈≈≈≈  │
 │ ≈≈≈≈≈≈≈≈≈≈≈≈≈≈≈  │ │ │ ≈≈≈≈≈≈≈≈≈≈       │
 │                 │ │ │                 │
 │  84t · 2.1s     │ │ │  67t · 1.4s     │
 │  $0.00021        │ │ │  $0.00018        │
 └─────────────────┘ │ └─────────────────┘
                     ▲
              same prompt

what it actually prints

real output from llmdiff "explain tcp vs udp in 3 sentences". two columns, aligned, metrics at the bottom.

$ llmdiff "explain tcp vs udp in 3 sentences"

openai/gpt-4o-mini                       anthropic/claude-3-5-haiku
─────────────────────────────────        ─────────────────────────────────
TCP is connection-oriented.              TCP establishes a reliable,
Guarantees ordered, reliable             ordered, connection-based link
delivery. UDP is connectionless,         between hosts. UDP is connection-
faster, no guarantees.                   less and best-effort — fire
                                         and forget.

tokens in:  14                           tokens in:  14
tokens out: 84                           tokens out: 67
latency:    2.1s                         latency:    1.4s
cost (est): $0.00021                     cost (est): $0.00018

openai + anthropic out of the box

no config gymnastics. drop in your keys, pick a pair of models, run. default is gpt-4o-mini vs claude-3-5-haiku.

side-by-side output with real metrics

two columns, same width, aligned. tokens in, tokens out, wall-clock latency. no scrolling back and forth between runs.

cost + concurrency, by default

both providers called in parallel, wall-clock is max(a, b) not a + b. cost estimate comes from a small price table in the source, so you stop blowing $30 on evals without noticing. prices go stale sometimes, a PR fixes that.

no sdks, `--json` if you want it

stdlib http.client only. no openai package, no anthropic package, no httpx. two small provider adapters you can read in a coffee break. --json pipes straight into jq, a spreadsheet, whatever eval harness you have.

yeah, you could curl both yourself. two requests, two jq incantations, a stopwatch, a calculator for the cost. i wrote it so i'd stop doing that.

install

not public yet

python 3.9+, macos and linux when it ships. repo isn't up yet. email bennett@frkhd.com if you want early access.