March 25, 20262 min read

How Idle GPUs Make Cheap Inference Possible

By Lucas Ewing

TL;DR

We serve Kimi K2.6 at $0.70/M input, $0.20/M cached input, $3.50/M output by running models on idle enterprise GPUs. No contracts, no minimums. Get API access or email contact@getlilac.com.

Why idle GPUs matter

The average enterprise GPU cluster runs at about 50% utilization. Training jobs finish, inference traffic dips, and the hardware just sits there. The power, cooling, and depreciation are already paid for, but the GPUs aren't doing anything.

Lilac's Kubernetes operator finds that spare capacity and spins up inference workloads on it. When the cluster's own jobs need GPUs back, our operator steps aside immediately. The GPU owner's workloads always come first.

Since the fixed costs are already covered, the cost of serving inference on that spare capacity is much lower than renting dedicated GPUs. We pass those savings through to you.

What we serve today

We're currently serving Kimi K2.6, Moonshot AI's latest Kimi model for coding, tools, and long-context work.

Model	Input Price	Output Price
Kimi K2.6	$0.70 / M tokens	$3.50 / M tokens

The API is OpenAI-compatible. Switching takes one line:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.getlilac.com/v1",
    api_key="lilac_sk_...",
)

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[{"role": "user", "content": "Hello!"}],
)

How we compare

We benchmark Kimi traffic with NVIDIA AIPerf and route it to warm shared capacity on idle enterprise GPUs.

Lilac's pricing changes with the model we host, but the underlying economics stay the same: route traffic to already-paid-for idle GPUs and pass the reserved-capacity savings through to customers.

How the operator works

GPU providers install our Kubernetes operator with a single kubectl apply. It does four things:

Monitors node utilization and finds reclaimable GPU capacity
Deploys inference servers (vLLM) onto idle nodes
Routes API requests to healthy instances
Preempts inference workloads when primary jobs need GPUs back

Providers choose which node pools to expose and set their own availability windows. Their workloads always take priority.

More on idle GPU economics: The GPU Scarcity Paradox. Broader pricing comparison: GPU Inference API Pricing Compared. Demand-side entry point: Inference.