open source · python · stdlib http · mit

one prompt. two models. one diff.

see which llm does your task better. tokens, latency, cost, all lined up. run both, eyeball it, pick one. lol that's it.

       prompt                              prompt
         ↓                                   ↓
 ┌─────────────────┐  ┌─────────────────┐
  openai/gpt-4o       anthropic/claude 
                                     
  response ≈≈≈≈≈≈     response ≈≈≈≈≈≈  
  ≈≈≈≈≈≈≈≈≈≈≈≈≈≈≈     ≈≈≈≈≈≈≈≈≈≈       
                                     
   84t · 2.1s         67t · 1.4s     
   $0.00021            $0.00018        
 └─────────────────┘  └─────────────────┘

              same prompt

what it actually prints

real output from llmdiff "explain tcp vs udp in 3 sentences". two columns, aligned, metrics at the bottom.

$ llmdiff "explain tcp vs udp in 3 sentences"

openai/gpt-4o-mini                       anthropic/claude-3-5-haiku
─────────────────────────────────        ─────────────────────────────────
TCP is connection-oriented.              TCP establishes a reliable,
Guarantees ordered, reliable             ordered, connection-based link
delivery. UDP is connectionless,         between hosts. UDP is connection-
faster, no guarantees.                   less and best-effort — fire
                                         and forget.

tokens in:  14                           tokens in:  14
tokens out: 84                           tokens out: 67
latency:    2.1s                         latency:    1.4s
cost (est): $0.00021                     cost (est): $0.00018

openai + anthropic out of the box

no config gymnastics. drop in your keys, pick a pair of models, run. default is gpt-4o-mini vs claude-3-5-haiku.

side-by-side output with real metrics

two columns, same width, aligned. tokens in, tokens out, wall-clock latency. no scrolling back and forth between runs.

cost + concurrency, by default

both providers called in parallel, wall-clock is max(a, b) not a + b. cost estimate comes from a small price table in the source, so you stop blowing $30 on evals without noticing. prices go stale sometimes, a PR fixes that.

no sdks, --json if you want it

stdlib http.client only. no openai package, no anthropic package, no httpx. two small provider adapters you can read in a coffee break. --json pipes straight into jq, a spreadsheet, whatever eval harness you have.

yeah, you could curl both yourself. two requests, two jq incantations, a stopwatch, a calculator for the cost. i wrote it so i'd stop doing that.

install

not public yet

python 3.9+, macos and linux when it ships. repo isn't up yet. email bennett@frkhd.com if you want early access.