Infrastructure Components Updates
To update any resource in existing infrastructure, infrastructure Terraform must be reapplied with required updates.
AKS version upgrade
AKS cluster version upgrades are managed via infrastructure Terraform by updating the kubernetes_version parameter in the infra.tfvars file.
Versioning strategy
AKS is configured with automatic_upgrade_channel = "patch" command.
This make sure that the patch updates. for example: 1.34.x, are applied automatically by Azure and manual intervention is not required for patch upgrades.
PTC recommends to use major.minor version only. For example: 1.34. Do not specify patch version
Upgrade Process
1. Update the version in terraform\deployment-profiles\infra-templates\infra.tfvars based on available upgrade version by running the following command.
az aks get-upgrades \
--resource-group <RESOURCE_GROUP_NAME> \
--name <AKS_CLUSTER_NAME>
# change version to 1.35
kubernetes_version = "1.35"
2. Plan the Terraform.
terraform plan -var-file=infra.tfvars
3. Apply the Terraform.
terraform apply -var-file=infra.tfvars
AKS supports one minor version upgrade at a time. For example: 1.34 to 1.35.
Skipping minor versions and downgrading versions is not supported
For information, refer to Microsoft documentation.
Downgrade AKS Version is not supported
AKS follows a forward-only upgrade model, and Kubernetes control plane or node version-skew rules are designed for upgrade paths, not rollback. Downgrading can cause API and workload incompatibilities, increase instability/outage risk, and may leave the cluster in an unsupported state.
Update OpenAI Model Capacity
Capacity of OpenAI model deployed in CHD can be updated using variables in the following file:
terraform/deployment-profiles/infra-templates/infra.tfvars
Configuration
1. Update the values as required.
# Modify capacity parameters according to your selection: PayGO or PTU
openai_gpt5_mini_capacity = <value>
openai_gpt5_nano_capacity = <value>

openai_gpt5_mini_spillover_capacity = <value>
openai_gpt5_nano_spillover_capacity = <value>
2. Plan the Terraform.
terraform plan -var-file=infra.tfvars
3. Apply the Terraform.
terraform apply -var-file=infra.tfvars
OpenAI Model SKU Migration
This section describes how to switch Azure OpenAI deployments between PayGo (pay-per-token) and PTU (Provisioned Throughput Units).
* 
All PTU capacities, TPM values, SKU names, and capacities below are examples. Select values based on workload needs, region availability, current Azure OpenAI offerings, and your subscription quotas.
Scenario 1: PayGo → PTU
1. Update infra.tfvars file.
# Before (PayGo)
openai_gpt5_mini_sku_name = "DataZoneStandard"
openai_gpt5_mini_capacity = 3000 # TPM
# After (PTU with spillover)
openai_gpt5_mini_sku_name = "DataZoneProvisionedManaged"
openai_gpt5_mini_capacity = 50 # PTU units
openai_gpt5_mini_spillover_sku_name = "DataZoneStandard"
openai_gpt5_mini_spillover_capacity = 3000 # TPM for spillover
2. Run Terraform Apply.
terraform plan -var-file="infra.tfvars"
terraform apply -var-file="infra.tfvars"
Terraform destroys the PayGo deployment.
Creates a new PayGo spillover deployment (gpt-5-mini-2025-08-07-spillover).
Creates the PTU deployment with spilloverDeploymentName pointing to the spillover.
The expected downtime depends on Azure provisioning time, - ~5-20 minutes during deployment switch.
Scenario 2: PTU → PayGo
1. Update infra.tfvars.
# Before (PTU)
openai_gpt5_mini_sku_name = "DataZoneProvisionedManaged"
openai_gpt5_mini_capacity = 50
openai_gpt5_mini_spillover_sku_name = "DataZoneStandard"
openai_gpt5_mini_spillover_capacity = 3000
# After (PayGo)
openai_gpt5_mini_sku_name = "DataZoneStandard"
openai_gpt5_mini_capacity = 3000 # TPM
# Spillover settings are ignored for PayGo SKUs
2. Run Terraform Apply.
terraform plan -var-file="infra.tfvars"
terraform apply -var-file="infra.tfvars"
Terraform destroys the PTU deployment.
Destroys the spillover deployment.
Creates a new PayGo deployment.
The expected downtime depends on Azure provisioning time, - ~2-5 minutes during deployment switch.
Scenario 3: Changing SKU Type
Switching between Deployments types for PTU deployments. For example: DataZoneProvisionedManaged to GlobalProvisionedManaged.
a. Update infra.tfvars.
# Before
openai_gpt5_mini_sku_name = "DataZoneProvisionedManaged"
openai_gpt5_mini_capacity = 50
# After (different PTU type)
openai_gpt5_mini_sku_name = "GlobalProvisionedManaged"
openai_gpt5_mini_capacity = 50
# Keep spillover settings unchanged!
b. Run Terraform Apply.
terraform plan -var-file="infra.tfvars"
terraform apply -var-file="infra.tfvars"
For PayGo to PayGo follow the same approach, and select only the SKU name and capacity as required.
Only the PTU deployment is replaced.
Spillover deployment remains unchanged.
New PTU deployment links to existing spillover.
The expected downtime depends on Azure provisioning time, - ~3-10 minutes.
Scenario 4: Changing Spillover model SKU
A two-Step process is required for changing the spillover model SKU.
Azure blocks deletion of a spillover deployment while a PTU deployment references it. Changing the spillover SKU in a single Terraform apply always fails.
Perform the following steps to change spillover from DataZoneStandard to GlobalStandard or vice versa.
1. Destroy the PTU Deployment.
terraform destroy -target="module.cognitive_deployment-gpt5-mini.azapi_resource.deployment_with_spillover[0]" -var-file="infra.tfvars"
This removes the PTU deployment and its reference to the spillover.
2. Update Spillover SKU in infra.tfvars.
# Change spillover SKU
openai_gpt5_mini_spillover_sku_name = "GlobalStandard" # Changed from DataZoneStandard
3. Apply to recreate everything.
terraform apply -var-file="infra.tfvars"
This recreates:
1. New spillover deployment with the new SKU.
2. New PTU deployment linked to the new spillover.
The expected downtime depends on Azure provisioning time. Duration of step 1 and step 3 may takes ~5 to 15 minutes or more.
Post SKU Update Verification
1. Verify Terraform outputs.
Run the following commands:
terraform output
Deployment names match intended SKU configuration.
Spillover names appear (PTU only) or are null (PayGo).
2. Verify Azure deployment state.
Run following command:
az cognitiveservices account deployment list \
--resource-group <rg> \
--name <cognitive-account-name> \
--query "[].{name:name, sku:sku.name, state:properties.provisioningState}" -o table
All deployments show provisioningState = Succeeded.
SKU matches expected values (DataZoneStandard / PTU variant).
If not creating or failed, then wait and retry.
PTU deployments time depends on Azure provisioning time. It can take several minutes even after Terraform completes.
3. Verify Private Endpoint DNS Resolution
Run from inside the VNet (AKS pod):
kubectl run dns-test --rm -it --image=busybox -- nslookup \
<cognitive-account>.openai.azure.com
This resolves to 10.x.x.x (private IP).
If not then it a Public IP to Private DNS zone link issue. And server can't findis a DNS propagation delay for which wait time is 5–10 min.
4. Verify cb-ai-service pods.
kubectl get pods -n cb-ai-service
All pods are in running state.
5. Verify the spillover. Perform this for PTU only.
az cognitiveservices account deployment list \
--resource-group <rg> \
--name <cognitive-account-name> \
--query "[?contains(name,'spillover')].{name:name, sku:sku.name, capacity:sku.capacity}" -o table
Expected outcomes are follows:
Spillover deployments exist with PayGo SKU, DataZoneStandard.
Capacity matches configured spillover_capacity.
If not, deployment is missing for primary deployment may not be PTU.
6. Verify the cb-a-service is running.
Verify the cb-ai-service deployments using the same steps applied after service deployment Deployment of Service.
Was this helpful?