|
|
Tool names like Prometheus, Alertmanager, Grafana are used here as examples only. Any equivalent observability stack, whether managed or self-hosted, can be used. Sizing figures assume a representative stack for capacity planning. Adjust values based on your selected platform, such as Azure Monitor or Datadog.
|
|
Workload
|
Type
|
Pool
|
|---|---|---|
|
cb-ai-service
|
Application (Python)
|
User pool
|
|
Observability and monitoring Tools
|
Observability (traces & telemetry pipeline)
Monitoring (metrics collection & alerting)
|
User pool
|
|
OTEL Collector
|
User pool
|
|
|
CoreDNS, kube-proxy, metrics-server
|
Kubernetes system components
|
System pool
|
|
Azure Policy, OMS Agent, Defender, CSI Driver
|
Azure security and compliance agents
|
System pool
|
|
Pool
|
Purpose
|
VM Family
|
Autoscaling
|
|---|---|---|---|
|
System pool
|
Kubernetes internals and Azure agents
|
Dasv5 series (AMD, cost-optimized)
|
Yes (2–3 nodes)
|
|
User pool
|
Application and observability workloads
|
Yes (3–12 nodes)
|
|
Profile
|
Concurrent users
|
|---|---|
|
Small (S)
|
Up to 100
|
|
Medium (M)
|
100–300
|
|
Large (L)
|
300–700
|
|
Parameter
|
Small
|
Medium
|
Large
|
|---|---|---|---|
|
User Pool VM
|
Standard_D8as_v5 (8 vCPU / 32 GiB)
|
Standard_D8as_v5 (8 vCPU / 32 GiB)
|
Standard_D8as_v5 (8 vCPU / 32 GiB)
|
|
User Pool Min / Max Nodes
|
3 / 6
|
3 / 10
|
3 / 12
|
|
System Pool VM
|
Standard_D2as_v5 (2 vCPU / 8 GiB)
|
Standard_D2as_v5 (2 vCPU / 8 GiB)
|
Standard_D2as_v5 (2 vCPU / 8 GiB)
|
|
System Pool Min / Max Nodes
|
2 / 3
|
2 / 3
|
2 / 3
|
|
Total App Pod Replicas
|
2-10
|
2-10
|
2-10
|
|
Prometheus/(any similar tool) Replicas
|
1
|
2 (HA)
|
2 (HA)
|
|
Grafana/(any similar tool) Replicas
|
1
|
1
|
2 (HA)
|
|
OTEL Collector Replicas
|
1
|
2
|
3
|
|
Small
|
Medium
|
Large
|
|
|---|---|---|---|
|
CPU Request
|
500m (0.5 vCPU)
|
1000m (1 vCPU)
|
1500m (1.5 vCPU)
|
|
Memory Request
|
1 GiB
|
1.5 GiB
|
3 GiB
|
|
HPA Target CPU
|
70%
|
70%
|
70%
|
|
HPA Min / Max Replicas
|
2 / 10
|
2 / 10
|
2 / 10
|
|
|
Prometheus is referenced as an example metrics solution. Any equivalent metrics and alerting solution (managed or self-hosted) can be used. Size CPU, memory , and PVC based on your retention and series count.
|
|
Small
|
Medium
|
Large
|
|
|---|---|---|---|
|
Replicas
|
1
|
2 (HA)
|
2 (HA)
|
|
CPU Request
|
500m (0.5 vCPU)
|
1000m (1 vCPU)
|
2000m (2 vCPU)
|
|
Memory Request
|
1 GiB
|
2 GiB
|
4 GiB
|
|
Concurrent Users
|
Active Series
|
Ingestion Rate
|
7-day PVC
|
15-day PVC
|
30-day PVC
|
|---|---|---|---|---|---|
|
~100 (Small)
|
20–50k
|
~500 samples/sec
|
~20 GiB
|
~40 GiB
|
~80 GiB
|
|
~300 (Medium)
|
50–100k
|
~1,500 samples/sec
|
~50 GiB
|
~100 GiB
|
~200 GiB
|
|
~700 (Large)
|
100–200k
|
~3,000 samples/sec
|
~100 GiB
|
~200 GiB
|
~400 GiB
|
|
Small
|
Medium
|
Large
|
|
|---|---|---|---|
|
Replicas
|
1
|
1
|
2 (HA)
|
|
CPU Request
|
100m (0.1 vCPU)
|
250m (0.25 vCPU)
|
500m (0.5 vCPU)
|
|
Memory Request
|
128 Mi
|
256 Mi
|
512 Mi
|
|
PVC
|
5 GiB
|
10 GiB
|
10 GiB
|
|
Small
|
Medium
|
Large
|
|
|---|---|---|---|
|
Mode
|
Deployment (gateway)
|
Deployment (gateway)
|
Deployment (gateway)
|
|
Replicas
|
1
|
2
|
3
|
|
CPU Request
|
250m (0.25 vCPU)
|
500m (0.5 vCPU)
|
1000m (1 vCPU)
|
|
Memory Request
|
512 Mi
|
1 GiB
|
2 GiB
|
|
PVC
|
5 GiB
|
5 GiB
|
10 GiB
|
|
Component
|
CPU Request
|
Memory Request
|
Type
|
|---|---|---|---|
|
Node Exporter
|
50–100m per node
|
30–64 Mi per node
|
DaemonSet (runs on every user pool node)
|
|
kube-state-metrics
|
50–200m
|
64–256 Mi
|
Single Deployment
|
|
Workload
|
Min Pods
|
Max Pods
|
CPU Request (Min)
|
CPU Request (Max)
|
Memory Request (Min)
|
Memory Request (Max)
|
|---|---|---|---|---|---|---|
|
cb-ai-service
|
2
|
10
|
1000m (1 vCPU)
|
5000m (5 vCPU)
|
2 GiB
|
10 GiB
|
|
Prometheus
|
1
|
1
|
500m
|
500m
|
1 GiB
|
1 GiB
|
|
Alertmanager
|
1
|
1
|
50m
|
50m
|
64 MiB
|
64 MiB
|
|
Node Exporter (DaemonSet)
|
3
|
5
|
150m
|
250m
|
90 MiB
|
150 MiB
|
|
kube-state-metrics
|
1
|
1
|
50m
|
50m
|
64 MiB
|
64 MiB
|
|
Grafana
|
1
|
1
|
100m
|
100m
|
128 MiB
|
128 MiB
|
|
OTEL Collector
|
2
|
2
|
500m
|
500m
|
1 GiB
|
1 GiB
|
|
Subtotal
|
11
|
21
|
2,350m (~2.4 vCPU)
|
6,450m (~6.5 vCPU)
|
~4.3 GiB
|
~12.4 GiB
|
|
+ 25% headroom
|
~3,000m (~3 vCPU)
|
~8,060m (~8.1 vCPU)
|
~5.4 GiB
|
~15.5 GiB
|
|
Workload
|
Min Pods
|
Max Pods
|
CPU Request (Min)
|
CPU Request (Max)
|
Memory Request (Min)
|
Memory Request (Max)
|
|---|---|---|---|---|---|---|
|
cb-ai-service
|
2
|
10
|
1000m (1 vCPU)
|
5000m (5 vCPU)
|
2 GiB
|
10 GiB
|
|
Prometheus
|
2 (HA)
|
2 (HA)
|
2000m (2 vCPU)
|
2000m (2 vCPU)
|
4 GiB
|
4 GiB
|
|
Alertmanager
|
2 (HA)
|
2 (HA)
|
100m
|
100m
|
128 MiB
|
128 MiB
|
|
Node Exporter (DaemonSet)
|
4
|
8
|
400m
|
800m
|
240 MiB
|
480 MiB
|
|
kube-state-metrics
|
1
|
1
|
100m
|
100m
|
128 MiB
|
128 MiB
|
|
Grafana
|
1
|
1
|
200m
|
200m
|
256 MiB
|
256 MiB
|
|
OTEL Collector
|
2
|
2
|
1000m (1 vCPU)
|
1000m (1 vCPU)
|
2 GiB
|
2 GiB
|
|
Subtotal
|
14
|
26
|
4,800m (~4.8 vCPU)
|
9,200m (~9.2 vCPU)
|
~8.7 GiB
|
~17 GiB
|
|
+ 25% headroom
|
~6,000m (~6 vCPU)
|
~11,500m (~11.5 vCPU)
|
~10.9 GiB
|
~21.2 GiB
|
|
Workload
|
Min Pods
|
Max Pods
|
CPU Request (Min)
|
CPU Request (Max)
|
Memory Request (Min)
|
Memory Request (Max)
|
|---|---|---|---|---|---|---|
|
cb-ai-service
|
2
|
10
|
1000m (1 vCPU)
|
5000m (5 vCPU)
|
2 GiB
|
10 GiB
|
|
Prometheus
|
2 (HA)
|
2 (HA)
|
4000m (4 vCPU)
|
4000m (4 vCPU)
|
8 GiB
|
8 GiB
|
|
Alertmanager
|
2 (HA)
|
2 (HA)
|
200m
|
200m
|
256 MiB
|
256 MiB
|
|
Node Exporter (DaemonSet)
|
6
|
12
|
600m
|
1200m
|
360 MiB
|
720 MiB
|
|
kube-state-metrics
|
1
|
1
|
200m
|
200m
|
256 MiB
|
256 MiB
|
|
Grafana
|
2
|
2
|
500m
|
500m
|
512 MiB
|
512 MiB
|
|
OTEL Collector
|
3
|
3
|
3000m (3 vCPU)
|
3000m (3 vCPU)
|
6 GiB
|
6 GiB
|
|
Subtotal
|
18
|
32
|
9,500m (~9.5 vCPU)
|
14,100m (~14.1 vCPU)
|
~17.3 GiB
|
~25.7 GiB
|
|
+ 25% headroom
|
~11,875m (~11.9 vCPU)
|
~17,625m (~17.6 vCPU)
|
~21.7 GiB
|
~32.1 GiB
|
|
Sizing Decision
|
Terraform Variable
|
File
|
|---|---|---|
|
User pool VM size
|
aks_user_pool_vm_size
|
infra.tfvars
|
|
User pool min/max nodes
|
aks_user_pool_min_count / aks_user_pool_max_count
|
infra.tfvars
|
|
System pool VM size
|
aks_system_pool_vm_size
|
infra.tfvars
|
|
System pool min/max nodes
|
aks_system_pool_min_count / aks_system_pool_max_count
|
infra.tfvars
|
|
OpenAI model capacity (TPM)
|
openai_gpt5_mini_capacity / openai_gpt5_nano_capacity
|
infra.tfvars
|
|
OpenAI deployment SKU
|
openai_gpt5_mini_sku_name / openai_gpt5_nano_sku_name
|
infra.tfvars
|
|
Max pods per node
|
Hardcoded to 50
|
modules/aks/main.tf
|
|
SKU Type
|
Examples
|
Billing
|
Use Case
|
|---|---|---|---|
|
Pay-as-you-go services
|
DataZoneStandard, GlobalStandard
|
Pay per token
|
Development, variable workloads
|
|
PTU
|
DataZoneProvisionedManaged, GlobalProvisionedManaged, ProvisionedManaged
|
Reserved capacity
|
Production, predictable workloads
|
# ── Small (100 concurrent users) ─────────────────────
aks_user_pool_vm_size = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 6
openai_gpt5_mini_capacity = 3000
openai_gpt5_nano_capacity = 3000
# ── Medium (300 concurrent users) ────────────────────
aks_user_pool_vm_size = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 10
openai_gpt5_mini_capacity = 6000
openai_gpt5_nano_capacity = 6000
# ── Large (700 concurrent users) ─────────────────────
aks_user_pool_vm_size = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 12
openai_gpt5_mini_capacity = 12000
openai_gpt5_nano_capacity = 12000
|
Topic
|
Official documentation link
|
|---|---|
|
Azure OpenAI Quotas & Limits
|
|
|
AKS Service Quotas & Limits
|
|
|
Azure Subscription & Service Limits (master list)
|
|
|
Azure Cognitive Services Limits
|
|
|
AKS Node Pool Constraints
|
|
Topic
|
Official documentation link
|
|---|---|
|
PTU Overview & Sizing
|
|
|
PTU Calculator (Capacity Planning)
|
|
|
Understanding PTU Allocation
|
|
|
PTU Getting Started Guide
|
|
|
Monitor PTU Utilization
|