Infrastructure Sizing Guide

Welcome to Codebeamer AI Help Center > Setting Up the Codebeamer AI > Customer-hosted Codebeamer AI > Recommendations and Best Practices > Infrastructure Sizing Guide

This guide provides production-grade sizing guidance for deploying Codebeamer AI on AKS, including:

• The primary application workload, cb-ai-service

• Full observability stack, OpenTelemetry Collector and observability and monitoring tools. For example, Prometheus, Grafana, and so on

• AKS system components

• Azure OpenAI model deployment capacity

Tool names like Prometheus, Alertmanager, Grafana are used here as examples only. Any equivalent observability stack, whether managed or self-hosted, can be used. Sizing figures assume a representative stack for capacity planning. Adjust values based on your selected platform, such as Azure Monitor or Datadog.

Scope & Assumptions

Workload types included

Workload	Type	Pool
cb-ai-service	Application (Python)	User pool
Observability and monitoring Tools	Observability (traces & telemetry pipeline) Monitoring (metrics collection & alerting)	User pool
OTEL Collector		User pool
CoreDNS, kube-proxy, metrics-server	Kubernetes system components	System pool
Azure Policy, OMS Agent, Defender, CSI Driver	Azure security and compliance agents	System pool

Sizing Inputs Checklist

Before selecting a sizing profile, gather the following information.

Traffic and concurrency

◦ Expected peak concurrent users

◦ Peak-to-average ratio (typical: 2–3×)

Availability

◦ Target uptime SLA (99.9% recommended)

◦ Maintenance and upgrade window preferences

AKS Architecture Baseline

Node Pools

Pool	Purpose	VM Family	Autoscaling
System pool	Kubernetes internals and Azure agents	Dasv5 series (AMD, cost-optimized)	Yes (2–3 nodes)
User pool	Application and observability workloads	Dasv5 series (AMD, cost-optimized)	Yes (3–12 nodes)

Why two pools?

◦ System pool runs AKS-managed pods (CoreDNS, Defender, etc.), and is isolated from application load.

◦ User pool scales with application demand. Autoscaler adds or removes nodes as traffic changes.

◦ The CriticalAddonsOnly taint prevents application pods from running on system nodes

Sizing profiles

Profile	Concurrent users
Small (S)	Up to 100
Medium (M)	100–300
Large (L)	300–700

Profile summary

Parameter	Small	Medium	Large
User Pool VM	Standard_D8as_v5 (8 vCPU / 32 GiB)	Standard_D8as_v5 (8 vCPU / 32 GiB)	Standard_D8as_v5 (8 vCPU / 32 GiB)
User Pool Min / Max Nodes	3 / 6	3 / 10	3 / 12
System Pool VM	Standard_D2as_v5 (2 vCPU / 8 GiB)	Standard_D2as_v5 (2 vCPU / 8 GiB)	Standard_D2as_v5 (2 vCPU / 8 GiB)
System Pool Min / Max Nodes	2 / 3	2 / 3	2 / 3
Total App Pod Replicas	2-10	2-10	2-10
Prometheus/(any similar tool) Replicas	1	2 (HA)	2 (HA)
Grafana/(any similar tool) Replicas	1	1	2 (HA)
OTEL Collector Replicas	1	2	3

Workload resource sizing

cb-ai-service (Application)

	Small	Medium	Large
CPU Request	500m (0.5 vCPU)	1000m (1 vCPU)	1500m (1.5 vCPU)
Memory Request	1 GiB	1.5 GiB	3 GiB
HPA Target CPU	70%	70%	70%
HPA Min / Max Replicas	2 / 10	2 / 10	2 / 10

Prometheus server

Prometheus is referenced as an example metrics solution. Any equivalent metrics and alerting solution (managed or self-hosted) can be used. Size CPU, memory , and PVC based on your retention and series count.

	Small	Medium	Large
Replicas	1	2 (HA)	2 (HA)
CPU Request	500m (0.5 vCPU)	1000m (1 vCPU)	2000m (2 vCPU)
Memory Request	1 GiB	2 GiB	4 GiB

Storage (PVC) and retention

PVC size and retention period depend on the metrics volume and compliance requirements of the deployment.

Estimation Guide

Concurrent Users	Active Series	Ingestion Rate	7-day PVC	15-day PVC	30-day PVC
~100 (Small)	20–50k	~500 samples/sec	~20 GiB	~40 GiB	~80 GiB
~300 (Medium)	50–100k	~1,500 samples/sec	~50 GiB	~100 GiB	~200 GiB
~700 (Large)	100–200k	~3,000 samples/sec	~100 GiB	~200 GiB	~400 GiB

Recommendations:

◦ Set retention based on your incident response SLA. 7 days is sufficient for most development ansd /test environments. Production typically needs 15–30 days.

◦ Always provision PVC 20% larger than the calculated estimate to account for label cardinality spikes.

◦ For retention beyond 30 days, use remote storage solutions such as Azure Monitor managed Prometheus, Thanos, or Cortex.

◦ Monitor Prometheus disk usage with prometheus_tsdb_storage_size_bytes and set alerts at 80% PVC utilization.

Grafana

	Small	Medium	Large
Replicas	1	1	2 (HA)
CPU Request	100m (0.1 vCPU)	250m (0.25 vCPU)	500m (0.5 vCPU)
Memory Request	128 Mi	256 Mi	512 Mi
PVC	5 GiB	10 GiB	10 GiB

Grafana is lightweight during idle operation but may experience CPU spikes during concurrent dashboard rendering. For high availability at large scale, use an external PostgreSQL database instead of SQLite-backed persistent volumes.

OpenTelemetry Collector

	Small	Medium	Large
Mode	Deployment (gateway)	Deployment (gateway)	Deployment (gateway)
Replicas	1	2	3
CPU Request	250m (0.25 vCPU)	500m (0.5 vCPU)	1000m (1 vCPU)
Memory Request	512 Mi	1 GiB	2 GiB
PVC	5 GiB	5 GiB	10 GiB

Monitoring Add-ons

Component	CPU Request	Memory Request	Type
Node Exporter	50–100m per node	30–64 Mi per node	DaemonSet (runs on every user pool node)
kube-state-metrics	50–200m	64–256 Mi	Single Deployment

AKS system Ccomponents

These run on the system pool and are managed by Azure. System pool headroom for Standard_D2as_v5 = 2 vCPU / 8 GiB are as follows:

• Kubelet/OS reservation: ~0.3 vCPU / 1 GiB

• System pods: ~1.2 vCPU / 2.5 GiB

• Remaining: ~0.5 vCPU / 4.5 GiB per node

• 2-node minimum provides sufficient capacity.

Total capacity calculations

Small (~100 concurrent users)

Workload	Min Pods	Max Pods	CPU Request (Min)	CPU Request (Max)	Memory Request (Min)	Memory Request (Max)
cb-ai-service	2	10	1000m (1 vCPU)	5000m (5 vCPU)	2 GiB	10 GiB
Prometheus	1	1	500m	500m	1 GiB	1 GiB
Alertmanager	1	1	50m	50m	64 MiB	64 MiB
Node Exporter (DaemonSet)	3	5	150m	250m	90 MiB	150 MiB
kube-state-metrics	1	1	50m	50m	64 MiB	64 MiB
Grafana	1	1	100m	100m	128 MiB	128 MiB
OTEL Collector	2	2	500m	500m	1 GiB	1 GiB
Subtotal	11	21	2,350m (~2.4 vCPU)	6,450m (~6.5 vCPU)	~4.3 GiB	~12.4 GiB
+ 25% headroom			~3,000m (~3 vCPU)	~8,060m (~8.1 vCPU)	~5.4 GiB	~15.5 GiB

Nodes: 1–2 at baseline, 2–3 at peak (Standard_D8as_v5: 8 vCPU, 32 GiB)

Medium (~300 concurrent users)

Workload	Min Pods	Max Pods	CPU Request (Min)	CPU Request (Max)	Memory Request (Min)	Memory Request (Max)
cb-ai-service	2	10	1000m (1 vCPU)	5000m (5 vCPU)	2 GiB	10 GiB
Prometheus	2 (HA)	2 (HA)	2000m (2 vCPU)	2000m (2 vCPU)	4 GiB	4 GiB
Alertmanager	2 (HA)	2 (HA)	100m	100m	128 MiB	128 MiB
Node Exporter (DaemonSet)	4	8	400m	800m	240 MiB	480 MiB
kube-state-metrics	1	1	100m	100m	128 MiB	128 MiB
Grafana	1	1	200m	200m	256 MiB	256 MiB
OTEL Collector	2	2	1000m (1 vCPU)	1000m (1 vCPU)	2 GiB	2 GiB
Subtotal	14	26	4,800m (~4.8 vCPU)	9,200m (~9.2 vCPU)	~8.7 GiB	~17 GiB
+ 25% headroom			~6,000m (~6 vCPU)	~11,500m (~11.5 vCPU)	~10.9 GiB	~21.2 GiB

Nodes: 2 at baseline, 3–4 at peak (Standard_D8as_v5: 8 vCPU, 32 GiB)

Large (~700 concurrent users)

Workload	Min Pods	Max Pods	CPU Request (Min)	CPU Request (Max)	Memory Request (Min)	Memory Request (Max)
cb-ai-service	2	10	1000m (1 vCPU)	5000m (5 vCPU)	2 GiB	10 GiB
Prometheus	2 (HA)	2 (HA)	4000m (4 vCPU)	4000m (4 vCPU)	8 GiB	8 GiB
Alertmanager	2 (HA)	2 (HA)	200m	200m	256 MiB	256 MiB
Node Exporter (DaemonSet)	6	12	600m	1200m	360 MiB	720 MiB
kube-state-metrics	1	1	200m	200m	256 MiB	256 MiB
Grafana	2	2	500m	500m	512 MiB	512 MiB
OTEL Collector	3	3	3000m (3 vCPU)	3000m (3 vCPU)	6 GiB	6 GiB
Subtotal	18	32	9,500m (~9.5 vCPU)	14,100m (~14.1 vCPU)	~17.3 GiB	~25.7 GiB
+ 25% headroom			~11,875m (~11.9 vCPU)	~17,625m (~17.6 vCPU)	~21.7 GiB	~32.1 GiB

Nodes: 3 at baseline, 6–8 at peak (Standard_D8as_v5: 8 vCPU, 32 GiB)

Autoscaling strategy

Autoscaling approach keeps the app responsive and efficient without manual tuning.

• Cluster Autoscaler adjusts nodes when scheduling demand changes. Therefore, pending pods get capacity and idle nodes are trimmed.

• HPA automatically adjusts pod replicas based on real-time load to keep the service responsive.

Implementation mapping (Terraform)

Which Variables Map to Sizing Decisions

Sizing Decision	Terraform Variable	File
User pool VM size	aks_user_pool_vm_size	infra.tfvars
User pool min/max nodes	aks_user_pool_min_count / aks_user_pool_max_count	infra.tfvars
System pool VM size	aks_system_pool_vm_size	infra.tfvars
System pool min/max nodes	aks_system_pool_min_count / aks_system_pool_max_count	infra.tfvars
OpenAI model capacity (TPM)	openai_gpt5_mini_capacity / openai_gpt5_nano_capacity	infra.tfvars
OpenAI deployment SKU	openai_gpt5_mini_sku_name / openai_gpt5_nano_sku_name	infra.tfvars
Max pods per node	Hardcoded to 50	modules/aks/main.tf

SKU types

SKU Type	Examples	Billing	Use Case
Pay-as-you-go services	DataZoneStandard, GlobalStandard	Pay per token	Development, variable workloads
PTU	DataZoneProvisionedManaged, GlobalProvisionedManaged, ProvisionedManaged	Reserved capacity	Production, predictable workloads

Example Parameter Sets

# ── Small (100 concurrent users) ─────────────────────
aks_user_pool_vm_size   = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 6
openai_gpt5_mini_capacity = 3000
openai_gpt5_nano_capacity = 3000
# ── Medium (300 concurrent users) ────────────────────
aks_user_pool_vm_size   = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 10
openai_gpt5_mini_capacity = 6000
openai_gpt5_nano_capacity = 6000
# ── Large (700 concurrent users) ─────────────────────
aks_user_pool_vm_size   = "Standard_D8as_v5"
aks_user_pool_min_count = 3
aks_user_pool_max_count = 12
openai_gpt5_mini_capacity = 12000
openai_gpt5_nano_capacity = 12000

Recommendation

Start by estimating user traffic, availability, and observability requirements. Select a small, medium, or large profile and configure the Terraform variables accordingly. Example: aks_user_pool_vm_size, aks_user_pool_min_count/max_count. Avoid choosing hard numbers, welect the profile that aligns with your needs.

Start with the small profile (100 concurrent users). It provides:

• Production-grade reliability—Three nodes, High Availability for critical components, 25% headroom.

• Growth path—Scale to medium or large by changing the required Terraform variables.

The cluster scales automatically within your chosen minimum and maximum limits. Start with a profile that fits your needs and adjust min_count and max_count over time. Redeployment is not required.

Restrictions and boundaries of configurable settings

Topic	Official documentation link
Azure OpenAI Quotas & Limits	https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
AKS Service Quotas & Limits	https://learn.microsoft.com/en-us/azure/aks/quotas-skus-regions
Azure Subscription & Service Limits (master list)	https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits
Azure Cognitive Services Limits	https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks
AKS Node Pool Constraints	https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#node-pools

How to calculate PTU based on token consumption

For information on token usage for Codebeamer AI, see Customer-Hosted Deployment Codebeamer AI Token Usage.

Topic	Official documentation link
PTU Overview & Sizing	https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput
PTU Calculator (Capacity Planning)	https://oai.azure.com/portal/calculator
Understanding PTU Allocation	https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/provisioned-throughput-onboarding
PTU Getting Started Guide	https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/provisioned-get-started
Monitor PTU Utilization	https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/monitoring

PTU calculation summary from Microsoft Docs

1. Gather your workload metrics.

▪ Average input tokens per request

▪ Average output tokens per request

▪ Requests per minute (RPM) during peak usage

2. Use the Azure PTU Calculator.

▪ Go to: https://oai.azure.com/portal/calculator.

▪ Select your model (for example., gpt-5-mini).

▪ Enter your workload characteristics.

▪ Calculator provides the recommended PTU count.

3. Validate with monitoring.

▪ After deployment, monitor the ProvisionedManagedUtilizationV2 metric.

▪ If utilization is consistently above 80%, increase PTU or enable spillover.

▪ If utilization is below 30%, consider reducing PTU or switching to Pay-As-You-Go services.

Key excerpt from Microsoft: "Provisioned throughput units (PTU) are generic units of model processing capacity that you use to size provisioned deployments to achieve the required throughput for processing prompts and generating completions." For more information, refer to What is provisioned throughput for Foundry Models?.

Was this helpful?