Welcome to the Team.
Hey! Sit down, grab a coffee. I'm glad you're joining the inference platform team. Today, I'm going to walk you through our crown jewel: Inferscale.
I know you've probably worked with standard Kubernetes tooling like the Horizontal Pod Autoscaler (HPA) or KEDA. They are great for microservices, but when we started serving Large Language Models (LLMs) at scale, those tools fell completely flat. Our users were getting terrible P99 latencies, and our GPU costs were through the roof.
So, we threw out the playbook and built Inferscale from scratch. By the end of this tutorial, you're going to understand every single module, why the math looks the way it does, and how all the pieces snap together into a beautiful, predictive Kubernetes Operator.
The Coffee Shop Analogy
Before we look at any Python code or mathematical formulas, let's build an intuition for Queueing Theory. Imagine a busy coffee shop.
- ☕ Baristas = Our GPU Pods (Servers)
- 🚶 Customers = LLM Inference Requests (Arrival Rate)
- ⏱️ Making a Latte = Token Generation (Service Rate)
- 🤬 Customers walking out angry = SLA Violations (Timeout)
If customers arrive faster than the baristas can make lattes, the line grows infinitely long. That's an Unstable System. But even if the baristas are fast enough on average, a sudden rush of 10 people off a bus will cause a temporary line. If that line gets too long, people wait too long, and they leave (SLA Violation).
Inferscale's job is to predict when that bus is arriving and proactively hire more baristas before the line gets out of hand, ensuring nobody ever waits longer than our promised SLA.
Why Standard Autoscaling Failed Us (Post-Mortem)
Let's talk about why we couldn't just use CPU usage to scale LLMs.
Incident Report: Black Friday Outage
Symptoms: Complete platform unavailability for 8 minutes. 100,000+ Gateway Timeout errors.
Root Cause Analysis: Our standard Horizontal Pod Autoscaler (HPA) detected a CPU spike. It immediately requested 50 new GPU pods. However, pulling the 40GB LLM model took 4 minutes per pod. The HPA reacted too late. By the time the pods were healthy, the requests had already timed out and the users had left angry.
An LLM isn't a simple stateless microservice. Loading a 70B parameter model from disk (or object storage) into GPU VRAM can take anywhere from 30 to 120 seconds. If a sudden spike in traffic hits, and we react *after* it happens, the new Pods will take 2 minutes to boot. By then, the user has already given up, and your SLA is ruined.
Second, LLMs are fundamentally constrained by GPU memory capacity (specifically the KV Cache), not just compute (CPU). A standard HPA looks at CPU hitting 80% and scales up. But an LLM might have a 10% CPU usage while its VRAM is completely saturated, leaving requests queueing up forever.
To fix this, we realized we needed two things:
- Prediction: We need to scale based on what traffic will be in 2 minutes (when the Pod is actually ready).
- Queueing Theory: We need mathematically guaranteed bounds on tail latency (P95/P99) based on request arrival rates, not arbitrary CPU thresholds.
System Architecture
Before we dive into the math, here is a 10,000-foot view of how the system operates inside the
cluster. Inferscale runs as a Kubernetes Operator (via the kopf framework) and
reconciles custom LLMPredictiveAutoscaler resources.
+-------------------------------------------------------------+
| Kubernetes Cluster |
| |
| +----------------+ +--------------------------+ |
| | | | `LLMPredictiveAutoscaler`| |
| | Prometheus / | | Custom Resource (CRD) | |
| | Metrics Server | +------------+-------------+ |
| | | | watch/reconcile |
| +-------+--------+ v |
| | +--------------------------+ |
| | query rate(t) | Inferscale Operator | |
| +---------------->| (Python / kopf) | |
| | | |
| | 1. Forecaster (EMA) | |
| | 2. Queueing-Math SLA Math| |
| | 3. Warm Pool Optimizer | |
| +------------+-------------+ |
| | patch status/scale|
| v |
| +--------------------------+ |
| | Target Deployment (vLLM) | |
| | - Pod (GPU) | |
| | - Pod (GPU) | |
| +--------------------------+ |
+-------------------------------------------------------------+
Seeing the Future (forecaster.py)
Let's look at forecaster.py. Its job is simple: give me an estimate of traffic $S$
seconds from now, where $S$ is our cold start delay.
We implemented Holt's Linear Trend (Double
Exponential Moving Average). Why EMA?
Because it's computationally dirt cheap, state is minimal, and it reacts beautifully to sudden trend
changes without wild oscillation like naive polynomial extrapolation. Let's break down the
Forecaster class function by function.
__init__(self, model_type: str = "EMA", alpha: float = 0.3)
The constructor initializes the forecaster. We default to "EMA" (Exponential Moving Average) with an alpha of 0.3. Alpha determines how much weight we give to the most recent observation versus the historical average. A higher alpha reacts faster but is more jittery.
We also initialize self.level (the baseline) and self.trend (the slope) to
None. self.beta is set to half of alpha, controlling how quickly the trend
adapts. We also keep a short self.history list just in case we want to plug in a more
complex model like ARIMA later.
update(self, current_rate: float)
This is the ingestion function. The Kubernetes Operator calls this periodically (e.g., every 15
seconds) passing in the latest requests/sec scraped from Prometheus. It appends the rate to our
history list (keeping only the last 100 to prevent memory leaks) and then delegates the math to the
specific model's update function (like _update_ema_holt).
_update_ema_holt(self, current_rate: float)
This is where the Holt's Linear Trend math actually happens.
def _update_ema_holt(self, current_rate: float):
if self.level is None:
self.level = current_rate
self.trend = 0.0
return
last_level = self.level
self.level = self.alpha * current_rate + (1 - self.alpha) * (last_level + self.trend)
self.trend = self.beta * (self.level - last_level) + (1 - self.beta) * self.trend
If it's the first time we see data, we just set the level to the current rate and trend to 0.
Otherwise, we calculate the new level by blending the current observation with the previous
expectation (last_level + trend). Then we update the trend by blending the
change in level with the previous trend.
forecast(self, horizon_seconds: float, step_seconds: float = 15.0) -> float
This is the output function. The Autoscaler asks: "What will the traffic be
horizon_seconds from now?" If we don't have any history, we return 0. Otherwise, we
delegate to the specific model's forecast implementation.
_forecast_ema_holt(self, horizon_seconds: float, step_seconds: float) -> float
To predict the future using our Holt model, we figure out how many "steps" into the future we are
looking (e.g., 60 seconds / 15 second intervals = 4 steps). We then take our current baseline
level, and add the trend multiplied by the number of steps. Finally, we do
a quick bounds check max(0.0, forecasted_rate) because you can't have negative web
traffic!
def _forecast_ema_holt(self, horizon_seconds: float, step_seconds: float) -> float:
steps_ahead = max(0.0, horizon_seconds / step_seconds)
forecasted_rate = self.level + (steps_ahead * self.trend)
return max(0.0, forecasted_rate)
The Heart of the Beast (queueing_math.py)
This is my favorite part of the codebase. We treat our Deployment as a massive M/M/k
queue. People arrive randomly (Poisson),
they are served in some exponential
distribution of time, and there are $k$ active servers (Pods). Let's dissect the
QueueingMath class function by function.
It gives us the exact mathematical probability that a request will be forced to wait in the queue because all $k$ servers are busy.
calculate_probability_of_wait(arrival_rate: float, service_rate: float, servers: int) -> float
This calculates the base probability $P(>0)$ - the chance a user has to wait *at all*. First we
calculate Traffic Intensity a = arrival_rate / service_rate. Then server utilization
rho = a / servers. If rho >= 1.0, it means traffic is arriving faster than
we drain it. The queue will grow infinitely, so the probability of waiting is 1.0 (100%).
def calculate_probability_of_wait(arrival_rate: float, service_rate: float, servers: int) -> float:
a = arrival_rate / service_rate
rho = a / servers
if rho >= 1.0: return 1.0 # Unstable system
sum_term = sum((a ** i) / math.factorial(i) for i in range(servers))
last_term = ((a ** servers) / math.factorial(servers)) * (servers / (servers - a))
return last_term / (sum_term + last_term)
The rest is substituting into the brutal Queueing-Math formula. We calculate the sum term for the denominator, attach the last term, and return the ratio.
calculate_probability_wait_longer_than_sla(arrival_rate, service_rate, servers, sla_seconds)
Management usually doesn't care if a user waits 0.01 seconds. They care if the user waits longer than our contracted Service Level Agreement (SLA, e.g., 0.5s).
def calculate_probability_wait_longer_than_sla(arrival_rate, service_rate, servers, sla_seconds):
p_wait_gt_0 = QueueingMath.calculate_probability_of_wait(arrival_rate, service_rate, servers)
exponent = -((servers * service_rate) - arrival_rate) * sla_seconds
return p_wait_gt_0 * math.exp(exponent)
This function uses the base probability of waiting ($P(>0)$) from the previous function, and extends
it using the exponential decay of queue clearing. The function exponentially decays as
sla_seconds increases, meaning waiting 5 seconds is exponentially less likely than
waiting 1 second.
required_replicas_for_sla(...)
This is the money function. It reverses the causality. Instead of giving us the probability of
failure for a known number of servers, it searches for the number of servers required to
guarantee our failure rate is below the max_violation_probability.
It calculates the absolute minimum servers needed for stability
k_min_stable = math.floor(arrival_rate / service_rate) + 1. Then, it uses a simple
linear search loop starting from k_min_stable up to max_servers,
evaluating the calculate_probability_wait_longer_than_sla function. The moment the
probability drops below our threshold, it returns that number of replicas!
Balancing the Books (optimizer.py)
Sometimes, it makes financial sense to keep extra GPUs running ("warm pool"). If violating an SLA causes users to churn (losing us money), we need to balance the cost of a running GPU instance against the mathematical penalty of SLA violations.
optimize_warm_pool_size(arrival_rate, service_rate, sla_seconds, cost_per_gpu_hour, ...) -> Tuple[int, float]
This is the sole method in the optimizer. It finds the optimal number of warm_replicas
that minimizes the total aggregate cost.
Our loss function is: $J(k) = (\text{Cost}_{\text{GPU}} \times k) + (\text{Penalty} \times
P_{\text{violation}})$. First, we find our stable floor
k_stable = math.floor(arrival_rate / service_rate) + 1. Then we iterate:
for k in range(start_k, max_replicas + 1):
p_violation = QueueingMath.calculate_probability_wait_longer_than_sla(...)
gpu_cost = cost_per_gpu_hour * k
sla_cost = sla_penalty_per_hour * p_violation
total_cost = gpu_cost + sla_cost
if total_cost < min_cost:
min_cost = total_cost
best_k = k
elif total_cost > min_cost:
# It's a convex function. If it goes up, we found the bottom!
break
Because adding GPUs costs money linearly, but drops SLA violations exponentially, there is a perfect
"valley" in the curve. By checking if total_cost > min_cost, we implement an early-exit
optimization: once the cost starts going back up, we know we've past the bottom of the valley, and
we can stop searching.
Putting It Together (autoscaler.py)
The Autoscaler class is the orchestrator that sits between the Kubernetes machinery and
the math modules. Let's see how it operates.
__init__(self, config: Dict[str, Any])
Initializes the state. It extracts the configuration blocks from our Custom Resource Definition,
initializes a ModelProfile (which holds our Forecaster), and sets up
variables for stability tracking like self.last_scale_time.
observe_metrics(self, current_rate: float)
A simple wrapper. The Operator passes the latest request rate scraped from Prometheus into this
function, which in turn calls forecaster.update(current_rate).
calculate_desired_replicas(self, current_replicas: int) -> int
This is the main aggregation pipeline where the magic happens.
_apply_stability_controls(self, raw_desired: int, current_replicas: int) -> int
Without stability controls, autoscalers yoyo up and down endlessly destroying cluster stability. You can scale up instantly, but you must wait before terminating an expensive GPU Pod.
This function clamps the desired replicas between the hard configured minWarmReplicas
and maxReplicas.
Crucially, it implements a Scale-down cooldown. If the math says we should drop
replicas (e.g., scale from 5 GPUs down to 2), the function checks
now - self.last_scale_time. If the cooldown period hasn't elapsed, it rejects the
scale-down request to prevent traffic thrashing.
Hooking into the Matrix (operator_main.py)
All this beautiful math is useless if it doesn't talk to Kubernetes. We wrote a custom operator
using kopf. We defined a Custom Resource Definition (CRD) called
LLMPredictiveAutoscaler.
Instead of a massive, nasty Go codebase, kopf lets us write exactly what we mean using
decorators. Here are the core lifecycle hooks.
create_fn(spec, name, namespace, **kwargs)
This function fires exactly once when a user runs kubectl apply -f my-autoscaler.yaml.
It constructs a unique key (namespace + name) and instantiates the core
Autoscaler(spec) engine, storing it in memory in the autoscalers
dictionary. Finally, it initializes the CRD status block so tools like
kubectl get lpa show valid data.
delete_fn(name, namespace, **kwargs)
Memory cleanup. If a user deletes the custom resource, this hook catches it and deletes the
in-memory Autoscaler engine instance from our dictionary.
reconcile_autoscaler(spec, name, namespace, status, patch, **kwargs)
This is the heartbeat of Inferscale. It's decorated with
@kopf.timer(..., interval=15.0) meaning it runs as a daemon every 15 seconds.
- It uses the python
kubernetesclient to fetch the current state of thetargetDeployment. - It observes metrics. (In our code, we mock this with a sine wave, but in prod, we'd fire a PromQL query across the network).
- It calls
engine.calculate_desired_replicas(current_replicas)which triggers that whole pipeline of Forecaster -> QueueingMath -> Optimizer we just learned about. - It patches the CRD's
statuswith the new desired target and the forecasted traffic. - The Action: If
desired_replicas != current_replicas, it literally patches the Kubernetes Deployment object via the API server, scaling the LLM up or down.
Let's briefly look at the Kubernetes infrastructure we use to get this python code running on the cluster.
The Infrastructure (.yaml)
Code is nothing without the infrastructure to run it. Here are the four foundational YAML files we created to deploy our custom operator.
crd.yaml (Custom Resource Definition)
This is where we extend Kubernetes. Kubernetes doesn't know what an
LLMPredictiveAutoscaler is. We have to teach the Kubernetes API server our custom
schema using OpenAPI v3 specifications. This file defines exactly what fields the autoscaler
requires (like targetDeployment, serviceRatePerPod, and
coldStartSeconds) and the types (e.g., string, float). It effectively creates a new
native API endpoint in the cluster!
rbac.yaml (Role-Based Access Control)
By default, Pods running inside Kubernetes have no permissions. Our operator Python script needs privileges to read its own CRDs and to mutate Deployments (scaling them up and down).
We created a ServiceAccount for our operator, a ClusterRole granting verbs
like get, watch, patch, update against Deployments and our custom APIs, and finally a
ClusterRoleBinding linking the two together. Without this, our operator gets an
immediate `403 Forbidden` from the API Server.
deployment.yaml
This tells Kubernetes to actually run our Python code. We spin up a Deployment consisting of one
replica (we only want one operator reconciling the cluster to prevent split-brain collisions). It
points to our packaged Docker image (inferscale-operator:latest) and runs the entry
point: kopf run /app/operator_main.py --all-namespaces, telling kopf to
listen across the whole cluster.
example.yaml
The testing dummy. This file contains a mock LLM target Deployment (like an NGINX container
masquerading as a vLLM cache) alongside an actual instance of our newly defined
LLMPredictiveAutoscaler CRD. When we run kubectl apply -f example.yaml,
our Python operator catches the event and immediately begins the Queueing-Math mathematical
reconciliation loop.
Take your time reading the code repo. Now that you know the purpose of every class, function, and YAML manifest, you're officially ready to start building. Grab a ticket from the board. You're going to do great things here.
Standard Tool Comparison
Where standard reactive autoscalers fail, Inferscale thrives. Here is how we stack up against the Kubernetes HorizontalPodAutoscaler (HPA) and KEDA.
"When to use what?" Decision Tree
As a junior engineer, navigating architectural trade-offs is hard. You might think you need Inferscale for a simple Nginx web server. Here is a simple guide:
- 🤷 Are your pods lightweight and
boot in <2 seconds?
➡️ Use standard HPA. - 📦 Do you scale based on Kafka
queue depth rather than math?
➡️ Use KEDA. - 🧠 Are you serving massive AI models
that take minutes to load, and SLA is critical?
➡️ Use Inferscale.
| Feature | HorizontalPodAutoscaler |
KEDA |
Inferscale |
|---|---|---|---|
| Paradigm | Reactive (CPU/Memory thresholds) | Reactive (Custom metrics/queues) | Predictive Queueing Theory |
| Cold Start | Very vulnerable (Scale up during spike violates SLA) | Vulnerable | Compensates automatically |
| Tail Latency | Best-effort | Best-effort | Mathematically guaranteed bound |
| Cost Opt | None | None | Minimizes GPU cost function |
Using CPU utilization for LLMs is fundamentally flawed because GPU memory loads/KV-cache dictate
throughput. Measuring rate() and queue depths against Queueing-Math yields robust
scaling.
Running Locally
Want to see it in action on your own machine? We use Minikube for local testing.
# 1. Install dependencies
pip install kopf kubernetes
# 2. Start your local cluster
minikube start
# 3. Install the infrastructure (CRDs and RBAC)
kubectl apply -f crd.yaml
kubectl apply -f rbac.yaml
# 4. Run the operator locally (it connects to your current kubeconfig context)
kopf run operator_main.py --all-namespaces
# 5. In a new terminal tab, apply the dummy LLM deployment and CRD:
kubectl apply -f example.yaml
Watch the logs in your kopf terminal. You will see the operator initialize, ingest the
mock metrics, perform the Queueing-Math calculations, and automatically maintain the desired
replicas on
the dummy deployment.