Chapter 1

Welcome to the Team.

Hey! Sit down, grab a coffee. I'm glad you're joining the inference platform team. Today, I'm going to walk you through our crown jewel: Inferscale.

I know you've probably worked with standard Kubernetes tooling like the Horizontal Pod Autoscaler (HPA) or KEDA. They are great for microservices, but when we started serving Large Language Models (LLMs) at scale, those tools fell completely flat. Our users were getting terrible P99 latencies, and our GPU costs were through the roof.

So, we threw out the playbook and built Inferscale from scratch. By the end of this tutorial, you're going to understand every single module, why the math looks the way it does, and how all the pieces snap together into a beautiful, predictive Kubernetes Operator.

🐍 Junior Engineer Bonus: Are you looking to level up your Python skills? We wrote a dedicated Python Learning Curriculum that breaks down the Inferscale codebase to teach you Production-Grade Patterns, OOP, and Event-Driven architecture!
Chapter 2

The Coffee Shop Analogy

Before we look at any Python code or mathematical formulas, let's build an intuition for Queueing Theory. Imagine a busy coffee shop.

  • Baristas = Our GPU Pods (Servers)
  • 🚶 Customers = LLM Inference Requests (Arrival Rate)
  • ⏱️ Making a Latte = Token Generation (Service Rate)
  • 🤬 Customers walking out angry = SLA Violations (Timeout)

If customers arrive faster than the baristas can make lattes, the line grows infinitely long. That's an Unstable System. But even if the baristas are fast enough on average, a sudden rush of 10 people off a bus will cause a temporary line. If that line gets too long, people wait too long, and they leave (SLA Violation).

Inferscale's job is to predict when that bus is arriving and proactively hire more baristas before the line gets out of hand, ensuring nobody ever waits longer than our promised SLA.

Chapter 3

Why Standard Autoscaling Failed Us (Post-Mortem)

Let's talk about why we couldn't just use CPU usage to scale LLMs.

Incident Report: Black Friday Outage

Symptoms: Complete platform unavailability for 8 minutes. 100,000+ Gateway Timeout errors.

Root Cause Analysis: Our standard Horizontal Pod Autoscaler (HPA) detected a CPU spike. It immediately requested 50 new GPU pods. However, pulling the 40GB LLM model took 4 minutes per pod. The HPA reacted too late. By the time the pods were healthy, the requests had already timed out and the users had left angry.

Grafana Post Mortem
The Cold Start Trap
An LLM isn't a simple stateless microservice. Loading a 70B parameter model from disk (or object storage) into GPU VRAM can take anywhere from 30 to 120 seconds. If a sudden spike in traffic hits, and we react *after* it happens, the new Pods will take 2 minutes to boot. By then, the user has already given up, and your SLA is ruined.

Second, LLMs are fundamentally constrained by GPU memory capacity (specifically the KV Cache), not just compute (CPU). A standard HPA looks at CPU hitting 80% and scales up. But an LLM might have a 10% CPU usage while its VRAM is completely saturated, leaving requests queueing up forever.

To fix this, we realized we needed two things:

  1. Prediction: We need to scale based on what traffic will be in 2 minutes (when the Pod is actually ready).
  2. Queueing Theory: We need mathematically guaranteed bounds on tail latency (P95/P99) based on request arrival rates, not arbitrary CPU thresholds.

Chapter 4

System Architecture

Before we dive into the math, here is a 10,000-foot view of how the system operates inside the cluster. Inferscale runs as a Kubernetes Operator (via the kopf framework) and reconciles custom LLMPredictiveAutoscaler resources.

+-------------------------------------------------------------+
|                     Kubernetes Cluster                      |
|                                                             |
|  +----------------+        +--------------------------+     |
|  |                |        | `LLMPredictiveAutoscaler`|     |
|  | Prometheus /   |        | Custom Resource (CRD)   |     |
|  | Metrics Server |        +------------+-------------+     |
|  |                |                     | watch/reconcile   |
|  +-------+--------+                     v                   |
|          |                 +--------------------------+     |
|          | query rate(t)   | Inferscale Operator      |     |
|          +---------------->| (Python / kopf)          |     |
|                            |                          |     |
|                            | 1. Forecaster (EMA)      |     |
|                            | 2. Queueing-Math SLA Math|     |
|                            | 3. Warm Pool Optimizer   |     |
|                            +------------+-------------+     |
|                                         | patch status/scale|
|                                         v                   |
|                            +--------------------------+     |
|                            | Target Deployment (vLLM) |     |
|                            | - Pod (GPU)              |     |
|                            | - Pod (GPU)              |     |
|                            +--------------------------+     |
+-------------------------------------------------------------+
Chapter 4

Seeing the Future (forecaster.py)

Let's look at forecaster.py. Its job is simple: give me an estimate of traffic $S$ seconds from now, where $S$ is our cold start delay.

We implemented Holt's Linear Trend (Double Exponential Moving Average). Why EMA? Because it's computationally dirt cheap, state is minimal, and it reacts beautifully to sudden trend changes without wild oscillation like naive polynomial extrapolation. Let's break down the Forecaster class function by function.

__init__(self, model_type: str = "EMA", alpha: float = 0.3)

The constructor initializes the forecaster. We default to "EMA" (Exponential Moving Average) with an alpha of 0.3. Alpha determines how much weight we give to the most recent observation versus the historical average. A higher alpha reacts faster but is more jittery.

We also initialize self.level (the baseline) and self.trend (the slope) to None. self.beta is set to half of alpha, controlling how quickly the trend adapts. We also keep a short self.history list just in case we want to plug in a more complex model like ARIMA later.

update(self, current_rate: float)

This is the ingestion function. The Kubernetes Operator calls this periodically (e.g., every 15 seconds) passing in the latest requests/sec scraped from Prometheus. It appends the rate to our history list (keeping only the last 100 to prevent memory leaks) and then delegates the math to the specific model's update function (like _update_ema_holt).

_update_ema_holt(self, current_rate: float)

This is where the Holt's Linear Trend math actually happens.

def _update_ema_holt(self, current_rate: float):
    if self.level is None:
        self.level = current_rate
        self.trend = 0.0
        return

    last_level = self.level
    self.level = self.alpha * current_rate + (1 - self.alpha) * (last_level + self.trend)
    self.trend = self.beta * (self.level - last_level) + (1 - self.beta) * self.trend

If it's the first time we see data, we just set the level to the current rate and trend to 0. Otherwise, we calculate the new level by blending the current observation with the previous expectation (last_level + trend). Then we update the trend by blending the change in level with the previous trend.

forecast(self, horizon_seconds: float, step_seconds: float = 15.0) -> float

This is the output function. The Autoscaler asks: "What will the traffic be horizon_seconds from now?" If we don't have any history, we return 0. Otherwise, we delegate to the specific model's forecast implementation.

_forecast_ema_holt(self, horizon_seconds: float, step_seconds: float) -> float

To predict the future using our Holt model, we figure out how many "steps" into the future we are looking (e.g., 60 seconds / 15 second intervals = 4 steps). We then take our current baseline level, and add the trend multiplied by the number of steps. Finally, we do a quick bounds check max(0.0, forecasted_rate) because you can't have negative web traffic!

def _forecast_ema_holt(self, horizon_seconds: float, step_seconds: float) -> float:
    steps_ahead = max(0.0, horizon_seconds / step_seconds)
    forecasted_rate = self.level + (steps_ahead * self.trend)
    return max(0.0, forecasted_rate)
Chapter 5

The Heart of the Beast (queueing_math.py)

This is my favorite part of the codebase. We treat our Deployment as a massive M/M/k queue. People arrive randomly (Poisson), they are served in some exponential distribution of time, and there are $k$ active servers (Pods). Let's dissect the QueueingMath class function by function.

The Queueing-Math Formula
It gives us the exact mathematical probability that a request will be forced to wait in the queue because all $k$ servers are busy.

calculate_probability_of_wait(arrival_rate: float, service_rate: float, servers: int) -> float

This calculates the base probability $P(>0)$ - the chance a user has to wait *at all*. First we calculate Traffic Intensity a = arrival_rate / service_rate. Then server utilization rho = a / servers. If rho >= 1.0, it means traffic is arriving faster than we drain it. The queue will grow infinitely, so the probability of waiting is 1.0 (100%).

def calculate_probability_of_wait(arrival_rate: float, service_rate: float, servers: int) -> float:
    a = arrival_rate / service_rate
    rho = a / servers
    
    if rho >= 1.0: return 1.0 # Unstable system
        
    sum_term = sum((a ** i) / math.factorial(i) for i in range(servers))
    last_term = ((a ** servers) / math.factorial(servers)) * (servers / (servers - a))
    
    return last_term / (sum_term + last_term)

The rest is substituting into the brutal Queueing-Math formula. We calculate the sum term for the denominator, attach the last term, and return the ratio.

calculate_probability_wait_longer_than_sla(arrival_rate, service_rate, servers, sla_seconds)

Management usually doesn't care if a user waits 0.01 seconds. They care if the user waits longer than our contracted Service Level Agreement (SLA, e.g., 0.5s).

def calculate_probability_wait_longer_than_sla(arrival_rate, service_rate, servers, sla_seconds):
    p_wait_gt_0 = QueueingMath.calculate_probability_of_wait(arrival_rate, service_rate, servers)
    exponent = -((servers * service_rate) - arrival_rate) * sla_seconds
    return p_wait_gt_0 * math.exp(exponent)

This function uses the base probability of waiting ($P(>0)$) from the previous function, and extends it using the exponential decay of queue clearing. The function exponentially decays as sla_seconds increases, meaning waiting 5 seconds is exponentially less likely than waiting 1 second.

required_replicas_for_sla(...)

This is the money function. It reverses the causality. Instead of giving us the probability of failure for a known number of servers, it searches for the number of servers required to guarantee our failure rate is below the max_violation_probability.

It calculates the absolute minimum servers needed for stability k_min_stable = math.floor(arrival_rate / service_rate) + 1. Then, it uses a simple linear search loop starting from k_min_stable up to max_servers, evaluating the calculate_probability_wait_longer_than_sla function. The moment the probability drops below our threshold, it returns that number of replicas!

Interactive Math Playground

Drag the sliders to see how adding requests, changing service times, or adding active pods completely alters the SLA violation probability mathematically.

Junior Engineer Guided Scenarios

  • Scenario 1: Set Arrival Rate to 100 and Pods to 12. Notice the red Unstable warning? Your CPU is fine, but math proves your queue is infinite. Try adding 2 more pods.
  • Scenario 2: Scroll down to the CFO Simulator below. Set GPU Cost to $15/hr. See how the Optimizer shrinks the warm pool to save money while risking slight SLA drops? This teaches you the intuition behind the numbers.

Base P(Wait)

0.00%

SLA Violation Risk

0.00%
Stable
Chapter 6

Balancing the Books (optimizer.py)

Sometimes, it makes financial sense to keep extra GPUs running ("warm pool"). If violating an SLA causes users to churn (losing us money), we need to balance the cost of a running GPU instance against the mathematical penalty of SLA violations.

optimize_warm_pool_size(arrival_rate, service_rate, sla_seconds, cost_per_gpu_hour, ...) -> Tuple[int, float]

This is the sole method in the optimizer. It finds the optimal number of warm_replicas that minimizes the total aggregate cost.

Our loss function is: $J(k) = (\text{Cost}_{\text{GPU}} \times k) + (\text{Penalty} \times P_{\text{violation}})$. First, we find our stable floor k_stable = math.floor(arrival_rate / service_rate) + 1. Then we iterate:

for k in range(start_k, max_replicas + 1):
    p_violation = QueueingMath.calculate_probability_wait_longer_than_sla(...)
    
    gpu_cost = cost_per_gpu_hour * k
    sla_cost = sla_penalty_per_hour * p_violation
    total_cost = gpu_cost + sla_cost
    
    if total_cost < min_cost:
        min_cost = total_cost
        best_k = k
    elif total_cost > min_cost:
        # It's a convex function. If it goes up, we found the bottom!
        break

Because adding GPUs costs money linearly, but drops SLA violations exponentially, there is a perfect "valley" in the curve. By checking if total_cost > min_cost, we implement an early-exit optimization: once the cost starts going back up, we know we've past the bottom of the valley, and we can stop searching.

The CFO Simulator

Enter your Cloud Provider's GPU hourly cost and see how different autoscaling paradigms hit your monthly bottom line. (Assuming 1M requests/month with heavy spiky traffic).

Static Provisioning (Always Peak)
$0
Reactive HPA (Lost Customers + Compute)
$0
Inferscale (Optimized Math)
$0
Chapter 7

Putting It Together (autoscaler.py)

The Autoscaler class is the orchestrator that sits between the Kubernetes machinery and the math modules. Let's see how it operates.

__init__(self, config: Dict[str, Any])

Initializes the state. It extracts the configuration blocks from our Custom Resource Definition, initializes a ModelProfile (which holds our Forecaster), and sets up variables for stability tracking like self.last_scale_time.

observe_metrics(self, current_rate: float)

A simple wrapper. The Operator passes the latest request rate scraped from Prometheus into this function, which in turn calls forecaster.update(current_rate).

calculate_desired_replicas(self, current_replicas: int) -> int

This is the main aggregation pipeline where the magic happens.

Execution Pipeline Visualizer

Click "Next Step" to trace an actual Prometheus metric all the way through the codebase to the final Operator patch action.

1

forecaster.py

Input: 10 req/s (Current)

Applying Double EMA...

Output: 15 req/s expected in 60s

2

queueing_math.py

Input: 15 req/s (Forecasted)

Searching for P(wait > 0.5s) < 1%...

Output: 4 Pods needed for SLA stability

3

optimizer.py

Input: 4 Pods (Minimum SLA bound)

Minimizing Cost Function J(k)...

Output: Keep 5 Pods warm (Extra $2 saves $10 penalty)

4

operator_main.py

Current State: 2 Pods running

kopf.patch(Deployment)

Action: Scale UP to 5 Pods

_apply_stability_controls(self, raw_desired: int, current_replicas: int) -> int

Stability First!
Without stability controls, autoscalers yoyo up and down endlessly destroying cluster stability. You can scale up instantly, but you must wait before terminating an expensive GPU Pod.

This function clamps the desired replicas between the hard configured minWarmReplicas and maxReplicas.

Crucially, it implements a Scale-down cooldown. If the math says we should drop replicas (e.g., scale from 5 GPUs down to 2), the function checks now - self.last_scale_time. If the cooldown period hasn't elapsed, it rejects the scale-down request to prevent traffic thrashing.

Chapter 8

Hooking into the Matrix (operator_main.py)

All this beautiful math is useless if it doesn't talk to Kubernetes. We wrote a custom operator using kopf. We defined a Custom Resource Definition (CRD) called LLMPredictiveAutoscaler.

Instead of a massive, nasty Go codebase, kopf lets us write exactly what we mean using decorators. Here are the core lifecycle hooks.

create_fn(spec, name, namespace, **kwargs)

This function fires exactly once when a user runs kubectl apply -f my-autoscaler.yaml. It constructs a unique key (namespace + name) and instantiates the core Autoscaler(spec) engine, storing it in memory in the autoscalers dictionary. Finally, it initializes the CRD status block so tools like kubectl get lpa show valid data.

delete_fn(name, namespace, **kwargs)

Memory cleanup. If a user deletes the custom resource, this hook catches it and deletes the in-memory Autoscaler engine instance from our dictionary.

reconcile_autoscaler(spec, name, namespace, status, patch, **kwargs)

This is the heartbeat of Inferscale. It's decorated with @kopf.timer(..., interval=15.0) meaning it runs as a daemon every 15 seconds.

  1. It uses the python kubernetes client to fetch the current state of the targetDeployment.
  2. It observes metrics. (In our code, we mock this with a sine wave, but in prod, we'd fire a PromQL query across the network).
  3. It calls engine.calculate_desired_replicas(current_replicas) which triggers that whole pipeline of Forecaster -> QueueingMath -> Optimizer we just learned about.
  4. It patches the CRD's status with the new desired target and the forecasted traffic.
  5. The Action: If desired_replicas != current_replicas, it literally patches the Kubernetes Deployment object via the API server, scaling the LLM up or down.

Let's briefly look at the Kubernetes infrastructure we use to get this python code running on the cluster.

Chapter 9

The Infrastructure (.yaml)

Code is nothing without the infrastructure to run it. Here are the four foundational YAML files we created to deploy our custom operator.

crd.yaml (Custom Resource Definition)

This is where we extend Kubernetes. Kubernetes doesn't know what an LLMPredictiveAutoscaler is. We have to teach the Kubernetes API server our custom schema using OpenAPI v3 specifications. This file defines exactly what fields the autoscaler requires (like targetDeployment, serviceRatePerPod, and coldStartSeconds) and the types (e.g., string, float). It effectively creates a new native API endpoint in the cluster!

rbac.yaml (Role-Based Access Control)

By default, Pods running inside Kubernetes have no permissions. Our operator Python script needs privileges to read its own CRDs and to mutate Deployments (scaling them up and down).

We created a ServiceAccount for our operator, a ClusterRole granting verbs like get, watch, patch, update against Deployments and our custom APIs, and finally a ClusterRoleBinding linking the two together. Without this, our operator gets an immediate `403 Forbidden` from the API Server.

deployment.yaml

This tells Kubernetes to actually run our Python code. We spin up a Deployment consisting of one replica (we only want one operator reconciling the cluster to prevent split-brain collisions). It points to our packaged Docker image (inferscale-operator:latest) and runs the entry point: kopf run /app/operator_main.py --all-namespaces, telling kopf to listen across the whole cluster.

example.yaml

The testing dummy. This file contains a mock LLM target Deployment (like an NGINX container masquerading as a vLLM cache) alongside an actual instance of our newly defined LLMPredictiveAutoscaler CRD. When we run kubectl apply -f example.yaml, our Python operator catches the event and immediately begins the Queueing-Math mathematical reconciliation loop.

Take your time reading the code repo. Now that you know the purpose of every class, function, and YAML manifest, you're officially ready to start building. Grab a ticket from the board. You're going to do great things here.

Chapter 10

Standard Tool Comparison

Where standard reactive autoscalers fail, Inferscale thrives. Here is how we stack up against the Kubernetes HorizontalPodAutoscaler (HPA) and KEDA.

"When to use what?" Decision Tree

As a junior engineer, navigating architectural trade-offs is hard. You might think you need Inferscale for a simple Nginx web server. Here is a simple guide:

  • 🤷 Are your pods lightweight and boot in <2 seconds?
    ➡️ Use standard HPA.
  • 📦 Do you scale based on Kafka queue depth rather than math?
    ➡️ Use KEDA.
  • 🧠 Are you serving massive AI models that take minutes to load, and SLA is critical?
    ➡️ Use Inferscale.
Feature HorizontalPodAutoscaler KEDA Inferscale
Paradigm Reactive (CPU/Memory thresholds) Reactive (Custom metrics/queues) Predictive Queueing Theory
Cold Start Very vulnerable (Scale up during spike violates SLA) Vulnerable Compensates automatically
Tail Latency Best-effort Best-effort Mathematically guaranteed bound
Cost Opt None None Minimizes GPU cost function

The Cold Start Race

Watch how HPA reacts *after* the spike hits (leading to dropped traffic during the boot time), compared to Inferscale which predicts the spike and boots exactly early enough to catch it.

Reactive HPA Waiting...
🚀 Traffic Spike Hits!
Booting Pod (60s) ... Dropping Requests
Pod Ready (Too Late)
Predictive Inferscale Waiting...
📈 Spike Predicted!
Booting Pod (60s) ...
Pod Ready
🚀 Traffic Spike Hits! (Handled)

Using CPU utilization for LLMs is fundamentally flawed because GPU memory loads/KV-cache dictate throughput. Measuring rate() and queue depths against Queueing-Math yields robust scaling.

Chapter 11

Running Locally

Want to see it in action on your own machine? We use Minikube for local testing.

# 1. Install dependencies
pip install kopf kubernetes

# 2. Start your local cluster
minikube start

# 3. Install the infrastructure (CRDs and RBAC)
kubectl apply -f crd.yaml
kubectl apply -f rbac.yaml

# 4. Run the operator locally (it connects to your current kubeconfig context)
kopf run operator_main.py --all-namespaces

# 5. In a new terminal tab, apply the dummy LLM deployment and CRD:
kubectl apply -f example.yaml

Watch the logs in your kopf terminal. You will see the operator initialize, ingest the mock metrics, perform the Queueing-Math calculations, and automatically maintain the desired replicas on the dummy deployment.