Infrastructure Sizing Guide
This guide provides production-grade sizing guidance for deploying Codebeamer AI on AKS, including:
The primary application workload, cb-ai-service
Full observability stack, OpenTelemetry Collector and observability and monitoring tools. For example, Prometheus, Grafana, and so on
AKS system components
Azure OpenAI model deployment capacity
* 
Tool names like Prometheus, Alertmanager, Grafana are used here as examples only. Any equivalent observability stack, whether managed or self-hosted, can be used. Sizing figures assume a representative stack for capacity planning. Adjust values based on your selected platform, such as Azure Monitor or Datadog.
Scope & Assumptions
Workload types included
Workload
Type
Pool
cb-ai-service
Application (Python)
User pool
Observability and monitoring Tools
Observability (traces & telemetry pipeline)
Monitoring (metrics collection & alerting)
User pool
OTEL Collector
User pool
CoreDNS, kube-proxy, metrics-server
Kubernetes system components
System pool
Azure Policy, OMS Agent, Defender, CSI Driver
Azure security and compliance agents
System pool
Sizing Inputs Checklist
Before selecting a sizing profile, gather the following information.
Traffic and concurrency
Expected peak concurrent users
Peak-to-average ratio (typical: 2–3×)
Availability
Target uptime SLA (99.9% recommended)
Maintenance and upgrade window preferences
AKS Architecture Baseline
Node Pools
Pool
Purpose
VM Family
Autoscaling
System pool
Kubernetes internals and Azure agents
Dasv5 series (AMD, cost-optimized)
Yes (2–3 nodes)
User pool
Application and observability workloads
Yes (3–12 nodes)
Why two pools?
System pool runs AKS-managed pods (CoreDNS, Defender, etc.), and is isolated from application load.
User pool scales with application demand. Autoscaler adds or removes nodes as traffic changes.
The CriticalAddonsOnly taint prevents application pods from running on system nodes
Sizing profiles
Profile
Concurrent users
Small (S)
Up to 100
Medium (M)
100–300
Large (L)
300–700
Profile summary
Parameter
Small
Medium
Large
User Pool VM
Standard_D8as_v5 (8 vCPU / 32 GiB)
Standard_D8as_v5 (8 vCPU / 32 GiB)
Standard_D8as_v5 (8 vCPU / 32 GiB)
User Pool Min / Max Nodes
3 / 6
3 / 10
3 / 12
System Pool VM
Standard_D2as_v5 (2 vCPU / 8 GiB)
Standard_D2as_v5 (2 vCPU / 8 GiB)
Standard_D2as_v5 (2 vCPU / 8 GiB)
System Pool Min / Max Nodes
2 / 3
2 / 3
2 / 3
Total App Pod Replicas
2-10
2-10
2-10
Prometheus/(any similar tool) Replicas
1
2 (HA)
2 (HA)
Grafana/(any similar tool) Replicas
1
1
2 (HA)
OTEL Collector Replicas
1
2
3
Workload resource sizing
cb-ai-service (Application)
Small
Medium
Large
CPU Request
500m (0.5 vCPU)
1000m (1 vCPU)
1500m (1.5 vCPU)
Memory Request
1 GiB
1.5 GiB
3 GiB
HPA Target CPU
70%
70%
70%
HPA Min / Max Replicas
2 / 10
2 / 10
2 / 10
Prometheus server
* 
Prometheus is referenced as an example metrics solution. Any equivalent metrics and alerting solution (managed or self-hosted) can be used. Size CPU, memory , and PVC based on your retention and series count.
Small
Medium
Large
Replicas
1
2 (HA)
2 (HA)
CPU Request
500m (0.5 vCPU)
1000m (1 vCPU)
2000m (2 vCPU)
Memory Request
1 GiB
2 GiB
4 GiB
Storage (PVC) and retention
PVC size and retention period depend on the metrics volume and compliance requirements of the deployment.
Estimation Guide
Concurrent Users
Active Series
Ingestion Rate
7-day PVC
15-day PVC
30-day PVC
~100 (Small)
20–50k
~500 samples/sec
~20 GiB
~40 GiB
~80 GiB
~300 (Medium)
50–100k
~1,500 samples/sec
~50 GiB
~100 GiB
~200 GiB
~700 (Large)
100–200k
~3,000 samples/sec
~100 GiB
~200 GiB
~400 GiB
Recommendations:
Set retention based on your incident response SLA. 7 days is sufficient for most development ansd /test environments. Production typically needs 15–30 days.
Always provision PVC 20% larger than the calculated estimate to account for label cardinality spikes.
For retention beyond 30 days, use remote storage solutions such as Azure Monitor managed Prometheus, Thanos, or Cortex.
Monitor Prometheus disk usage with prometheus_tsdb_storage_size_bytes and set alerts at 80% PVC utilization.
Grafana
Small
Medium
Large
Replicas
1
1
2 (HA)
CPU Request
100m (0.1 vCPU)
250m (0.25 vCPU)
500m (0.5 vCPU)
Memory Request
128 Mi
256 Mi
512 Mi
PVC
5 GiB
10 GiB
10 GiB
Grafana is lightweight during idle operation but may experience CPU spikes during concurrent dashboard rendering. For high availability at large scale, use an external PostgreSQL database instead of SQLite-backed persistent volumes.
OpenTelemetry Collector
Small
Medium
Large
Mode
Deployment (gateway)
Deployment (gateway)
Deployment (gateway)
Replicas
1
2
3
CPU Request
250m (0.25 vCPU)
500m (0.5 vCPU)
1000m (1 vCPU)
Memory Request
512 Mi
1 GiB
2 GiB
PVC
5 GiB
5 GiB
10 GiB
Monitoring Add-ons
Component
CPU Request
Memory Request
Type
Node Exporter
50–100m per node
30–64 Mi per node
DaemonSet (runs on every user pool node)
kube-state-metrics
50–200m
64–256 Mi
Single Deployment
AKS system Ccomponents
These run on the system pool and are managed by Azure. System pool headroom for Standard_D2as_v5 = 2 vCPU / 8 GiB are as follows:
Kubelet/OS reservation: ~0.3 vCPU / 1 GiB
System pods: ~1.2 vCPU / 2.5 GiB
Remaining: ~0.5 vCPU / 4.5 GiB per node
2-node minimum provides sufficient capacity.
Total capacity calculations
Small (~100 concurrent users)
Workload
Min Pods
Max Pods
CPU Request (Min)
CPU Request (Max)
Memory Request (Min)
Memory Request (Max)
cb-ai-service
2
10
1000m (1 vCPU)
5000m (5 vCPU)
2 GiB
10 GiB
Prometheus
1
1
500m
500m
1 GiB
1 GiB
Alertmanager
1
1
50m
50m
64 MiB
64 MiB
Node Exporter (DaemonSet)
3
5
150m
250m
90 MiB
150 MiB
kube-state-metrics
1
1
50m
50m
64 MiB
64 MiB
Grafana
1
1
100m
100m
128 MiB
128 MiB
OTEL Collector
2
2
500m
500m
1 GiB
1 GiB
Subtotal
11
21
2,350m (~2.4 vCPU)
6,450m (~6.5 vCPU)
~4.3 GiB
~12.4 GiB
+ 25% headroom
~3,000m (~3 vCPU)
~8,060m (~8.1 vCPU)
~5.4 GiB
~15.5 GiB
Nodes: 1–2 at baseline, 2–3 at peak (Standard_D8as_v5: 8 vCPU, 32 GiB)
Medium (~300 concurrent users)
Workload
Min Pods
Max Pods
CPU Request (Min)
CPU Request (Max)
Memory Request (Min)
Memory Request (Max)
cb-ai-service
2
10
1000m (1 vCPU)
5000m (5 vCPU)
2 GiB
10 GiB
Prometheus
2 (HA)
2 (HA)
2000m (2 vCPU)
2000m (2 vCPU)
4 GiB
4 GiB
Alertmanager
2 (HA)
2 (HA)
100m
100m
128 MiB
128 MiB
Node Exporter (DaemonSet)
4
8
400m
800m
240 MiB
480 MiB
kube-state-metrics
1
1
100m
100m
128 MiB
128 MiB
Grafana
1
1
200m
200m
256 MiB
256 MiB
OTEL Collector
2
2
1000m (1 vCPU)
1000m (1 vCPU)
2 GiB
2 GiB
Subtotal
14
26
4,800m (~4.8 vCPU)
9,200m (~9.2 vCPU)
~8.7 GiB
~17 GiB
+ 25% headroom
~6,000m (~6 vCPU)
~11,500m (~11.5 vCPU)
~10.9 GiB
~21.2 GiB
Nodes: 2 at baseline, 3–4 at peak (Standard_D8as_v5: 8 vCPU, 32 GiB)
Large (~700 concurrent users)
Workload
Min Pods
Max Pods
CPU Request (Min)
CPU Request (Max)
Memory Request (Min)
Memory Request (Max)
cb-ai-service
2
10
1000m (1 vCPU)
5000m (5 vCPU)
2 GiB
10 GiB
Prometheus
2 (HA)
2 (HA)
4000m (4 vCPU)
4000m (4 vCPU)
8 GiB
8 GiB
Alertmanager
2 (HA)
2 (HA)
200m
200m
256 MiB
256 MiB
Node Exporter (DaemonSet)
6
12
600m
1200m
360 MiB
720 MiB
kube-state-metrics
1
1
200m
200m
256 MiB
256 MiB
Grafana
2
2
500m
500m
512 MiB
512 MiB
OTEL Collector
3
3
3000m (3 vCPU)
3000m (3 vCPU)
6 GiB
6 GiB
Subtotal
18
32
9,500m (~9.5 vCPU)
14,100m (~14.1 vCPU)
~17.3 GiB
~25.7 GiB
+ 25% headroom
~11,875m (~11.9 vCPU)
~17,625m (~17.6 vCPU)
~21.7 GiB
~32.1 GiB
Nodes: 3 at baseline, 6–8 at peak (Standard_D8as_v5: 8 vCPU, 32 GiB)
Autoscaling strategy
Autoscaling approach keeps the app responsive and efficient without manual tuning.
Cluster Autoscaler adjusts nodes when scheduling demand changes. Therefore, pending pods get capacity and idle nodes are trimmed.
HPA automatically adjusts pod replicas based on real-time load to keep the service responsive.
Implementation mapping (Terraform)
Which Variables Map to Sizing Decisions
Sizing Decision
Terraform Variable
File
User pool VM size
aks_user_pool_vm_size
infra.tfvars
User pool min/max nodes
aks_user_pool_min_count / aks_user_pool_max_count
infra.tfvars
System pool VM size
aks_system_pool_vm_size
infra.tfvars
System pool min/max nodes
aks_system_pool_min_count / aks_system_pool_max_count
infra.tfvars
OpenAI model capacity (TPM)
openai_gpt5_mini_capacity / openai_gpt5_nano_capacity
infra.tfvars
OpenAI deployment SKU
openai_gpt5_mini_sku_name / openai_gpt5_nano_sku_name
infra.tfvars
Max pods per node
Hardcoded to 50
modules/aks/main.tf
SKU types
SKU Type
Examples
Billing
Use Case
Pay-as-you-go services
DataZoneStandard, GlobalStandard
Pay per token
Development, variable workloads
PTU
DataZoneProvisionedManaged, GlobalProvisionedManaged, ProvisionedManaged
Reserved capacity
Production, predictable workloads
Example Parameter Sets
# ── Small (100 concurrent users) ─────────────────────
aks_user_pool_vm_size = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 6
openai_gpt5_mini_capacity = 3000
openai_gpt5_nano_capacity = 3000
# ── Medium (300 concurrent users) ────────────────────
aks_user_pool_vm_size = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 10
openai_gpt5_mini_capacity = 6000
openai_gpt5_nano_capacity = 6000
# ── Large (700 concurrent users) ─────────────────────
aks_user_pool_vm_size = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 12
openai_gpt5_mini_capacity = 12000
openai_gpt5_nano_capacity = 12000
Recommendation
Start by estimating user traffic, availability, and observability requirements. Select a small, medium, or large profile and configure the Terraform variables accordingly. Example: aks_user_pool_vm_size, aks_user_pool_min_count/max_count. Avoid choosing hard numbers, welect the profile that aligns with your needs.
Start with the small profile (100 concurrent users). It provides:
Production-grade reliability—Three nodes, High Availability for critical components, 25% headroom.
Growth path—Scale to medium or large by changing the required Terraform variables.
The cluster scales automatically within your chosen minimum and maximum limits. Start with a profile that fits your needs and adjust min_count and max_count over time. Redeployment is not required.
Restrictions and boundaries of configurable settings
How to calculate PTU based on token consumption
For information on token usage for Codebeamer AI, see Customer-Hosted Deployment Codebeamer AI Token Usage.
PTU calculation summary from Microsoft Docs
1. Gather your workload metrics.
Average input tokens per request
Average output tokens per request
Requests per minute (RPM) during peak usage
2. Use the Azure PTU Calculator.
Select your model (for example., gpt-5-mini).
Enter your workload characteristics.
Calculator provides the recommended PTU count.
3. Validate with monitoring.
After deployment, monitor the ProvisionedManagedUtilizationV2 metric.
If utilization is consistently above 80%, increase PTU or enable spillover.
If utilization is below 30%, consider reducing PTU or switching to Pay-As-You-Go services.
Key excerpt from Microsoft: "Provisioned throughput units (PTU) are generic units of model processing capacity that you use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions." For more information, refer to What is provisioned throughput for Foundry Models?.
Was this helpful?