Troubleshooting

Welcome to Codebeamer AI Help Center > Setting Up the Codebeamer AI > Customer-hosted Codebeamer AI > Deploying Customer-hosted Codebeamer AI > Troubleshooting

Troubleshooting

This topic describes common failure scenarios when deploying Codebeamer AI in a customer-hosted environment using Terraform, and explains the correct recovery actions for each case.

Terraform deployment rollback and state recovery

When you use Terraform with a remote Azure Storage backend, deployments can fail for various reasons. Understanding whether Terraform state is valid is critical to choose the correct recovery approach.

In most cases, rolling back Terraform code and reapplying the configuration is the correct approach. Blob versioning must only be used when the state file itself becomes corrupted or unreadable.

Terraform failure scenarios and correct recovery approach

The following scenarios describe how Terraform behaves during failures and the correct recovery method.

Scenario	State status	Should previous state be restored?	Correct recovery approach
Terraform apply fails completely. Resources are not created.	State unchanged	No	Revert code and run Terraform plan.
Apply partially succeeds, In this case, some resources are created.	State partially updated but accurate	No	Revert code and rerun Terraform apply.
Apply crashes during state write.	State corrupted or unreadable	Yes	Restore the previous blob version and rerun Terraform.
Apply succeeds but deployment must be undone.	State valid	No	Revert code and rerun Terraform apply.

Scenario 1 — Terraform apply fails completely

Terraform attempts to create a resource, but Azure rejected the request before any resources are created.

Common causes

◦ Invalid configuration

◦ Quota exceeded

◦ Invalid model name

◦ Permission issues

Example situation

Component	Status
Code	Contains new resource
State	Unchanged
Azure	No resource created

Terraform only updates the state after successful operations, therefore the state remains unchanged.

Example

Code: gpt-5-nano deployment added
State: V1 (unchanged)
Azure: resource does not exist

Resolution

Revert the code change.

#if using git...
git checkout <earlier_branch/earlier_tag>
terraform plan -var-file=infra.tfvars

Expected result: No changes

State Restore Required: No

Scenario 2 – Terraform apply partially succeeds

Terraform successfully creates some resources and then fails. Terraform writes state after every successful operation, so the state accurately reflects the deployed resources.

Example scenario

Component	Status
Code	Contains resource1, resource2, resource3
State	Contains resource1 and resource2
Azure	resource1 and resource2 exist, resource3 was not created

For example:

Code: resource1, resource2, resource3
State: resource1, resource2
Azure: resource1, resource2

The deployment failed when Terraform attempted to create resource3.

Option 1–Fix the issue and rerun Terraform

Use this option when the failure was caused by a temporary or correctable issue, such as a quota limit, temporary API error, or misconfiguration.

terraform apply -var-file=infra.tfvars

Terraform then:

1. Detects resource1 and resource2 already exist in state.

2. Skips resource1 and resource2.

3. Attempts to create resource3 again.

Result:

Component	Status
State	resource1, resource2, resource3
Azure	resource1, resource2, resource3

This continues the deployment from where it failed.

Option 2 – Revert the code

Use this option when the deployment itself is incorrect. For example, a model deployment was mistakenly added.

#if using git...
git checkout <earlier_branch/earlier_tag>
terraform apply -var-file=infra.tfvars

This results in the following state:

Component	Status
Code	resource1 and resource2 removed
State	resource1 and resource2 exist
Azure	resource1 and resource2 exist

Terraform detects that the resources exist in state but no longer exist in code, so it destroys them.

Terraform plan shows the following:

- destroy resource1
- destroy resource2

After apply, the result is as follows:

Component	Status
State	clean
Azure	resource1 and resource2 removed

Scenario 3 – Terraform state corruption recovery using Azure Blob Versioning

Terraform relies on the state file to map infrastructure resources. If the state file becomes corrupted or unreadable, Terraform commands such as plan, apply, and destroy fail.

To recover from this situation, the previous version of the state file must be restored from Azure Blob Versioning. Blob versioning preserves historical copies of the state file every time it is modified, allowing safe recovery from corruption.

Possible causes

◦ Terraform process interruption during state write

◦ Terminal crash during terraform apply

◦ Network interruption during state upload

This results in the following state:

Component	Status
Code	May contain new resources
State	Corrupted or unreadable
Azure	Resources may or may not exist

Example of Terraform error

terraform plan -var-file=infra.tfvars
╷
│ Error: Unsupported state file format
│
│ The version in the state file is string. A positive whole number is required.
╵
╷
│ Error: Unsupported state file format
│
│ The state file does not have a "version" attribute, which is required to identify the format version.
OR
$ terraform plan -var-file=infra.tfvars
│ Error: Unsupported state file format
│
│ The state file could not be parsed as JSON: syntax error at 
│ byte offset 7805.

This indicates the state file is no longer valid JSON.

Recovery procedure

1. Confirm state corruption.

terraform plan -var-file=infra.tfvars

Expected error:

terraform plan -var-file=infra.tfvars
╷
│ Error: Unsupported state file format
│
│ The version in the state file is string. A positive whole number is required.
╵
╷
│ Error: Unsupported state file format
│
│ The state file does not have a "version" attribute, which is required to identify the format version.
OR
$ terraform plan -var-file=infra.tfvars
│ Error: Unsupported state file format
│
│ The state file could not be parsed as JSON: syntax error at 
│ byte offset 7805.

This confirms that the state file is corrupted.

2. List available blob versions.

Blob versioning automatically restores previous state file versions.

az storage blob list \
  --container-name <container_name>\
  --account-name <storage_account_name> \
  --include v \
  --auth-mode login \
  --query "[?name=='terraform.tfstate'].{name:name, Version:versionId, Modified:properties.lastModified}" \
  --output table

Example output:

Name	Version	Modified
terraform.tfstate	2026-03-13T11:00:00.0000000Z	2026-03-13T11:00:00
terraform.tfstate	2026-03-13T11:05:00.0000000Z	2026-03-13T11:05:00

Identify and copy the last known good version ID. The older version is the valid state and the latest version is the corrupted state

3. Download the last known good version.

Download the valid state file from blob versioning.

az storage blob download \
  --container-name <container_name> \
  --account-name <storage_account_name> \
  --name terraform.tfstate \
  --version-id <version_id> \
  --file recovered-state.json \
  --auth-mode login

4. Restore the state file.

Upload the recovered state file to replace the corrupted blob.

az storage blob upload \
  --container-name <container_name> \
  --account-name <storage_account_name> \
  --name terraform.tfstate \
  --file recovered-state.json \
  --overwrite \
  --auth-mode login

This restores the working Terraform state.

5. Verify Terraform operation.

terraform plan -var-file=infra.tfvars

This confirms the state recovery was successful.

6. Confirm version history is preserved.

List blob versions again.

az storage blob list \
  --container-name <container_name>\
  --account-name <storage_account_name> \
  --include v \
  --auth-mode login \
  --query "[?name=='terraform.tfstate'].{name:name, Version:versionId, Modified:properties.lastModified}" \
  --output table

Example:

Version	Description
V1	Original valid state
V2	Corrupted state
V3	Restored valid state

Blob versioning preserves the entire audit trail of state changes.

Test Validation Results

Test step	Expected outcome
Corrupt state file	Terraform fails to read state
Check blob versions	Previous state version exists
Download valid version	Valid JSON state retrieved
Restore state file	Blob replaced successfully
Run Terraform plan	Terraform fully operational

Result: State successfully recovered using blob versioning without manual resource import.

Scenario 4 – Deployment succeeded but must be reverted

Terraform apply succeeded, but the deployment must be undone.

Example

Component	Status
Code	Contains gpt-5-nano
State	Contains gpt-5-nano
Azure	gpt-5-nano exists

Resolution

Reverse Terraform code, and reapply.

#if using git...
git checkout <earlier_branch/earlier_tag>
terraform apply -var-file=infra.tfvars

Terraform detects that the resource exists in state but not in code and deletes it.

Scenario 5 – State lock error

Terraform cannot acquire the state lock because a previous operation was interrupted or is still running.

Users usually encounter this when retrying apply or destroy.

Example error

│Error: Error acquiring the state lock

│ Error message: state blob is already locked
│ Lock Info:
│   ID:        47befb15-5e0e-908e-d1cd-298e3c723f3d
│   Path:      tfstate/infra.tfstate
│   Operation: OperationTypePlan
│   Who:       user@machine
│   Version:   1.14.0
│   Created:   2026-04-09 07:02:45 +0000 UTC

Possible causes

◦ Previous Terraform command was interrupted

◦ Terminal or SSH session crashed during operation

◦ Network disconnection during state write

◦ Another user or pipeline running Terraform concurrently

Resolution

1. Break the existing lease.

az storage blob lease break \
  --account-name "<storage-account>" \
  --container-name "<container>" \
  --blob-name "<state-file>.tfstate" \
  --auth-mode login

2. Rerun Terraform.

Scenario 6 – Network or TLS errors

Terraform fails due to network connectivity issues with Azure APIs.

Example errors

◦ TLS handshake timeout

│ Error: creating Private Endpoint: Put "https://management.azure.com/...": 
│ net/http: TLS handshake timeout

◦ Connection reset

│ Error: creating AKS Cluster: HTTP response was nil; connection may have been reset

◦ Context deadline exceeded

│ Error: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Possible causes

◦ Unstable network connection

◦ VPN or proxy interference

◦ Azure API temporary issues

◦ Corporate firewall blocking requests

◦ Request timeout too short for slow operations

Resolution

Use one of the following suitable resolution.

◦ Retry. This is the most common fix.

terraform apply -var-file="infra.tfvars"

◦ Increase timeout.

$env:ARM_CLIENT_TIMEOUT_SECONDS = "3600"
terraform apply -var-file="infra.tfvars"

◦ Check network.

▪ Disable VPN temporarily

▪ Check proxy settings

▪ Verify firewall rules allow Azure management endpoints

Scenario 7 – Resource already exists

Terraform attempts to create a resource that already exists in Azure but is not tracked in state.

Example error

│ Error: a resource with the ID "/subscriptions/xxx/resourceGroups/my-rg/providers/
│ Microsoft.ContainerService/managedClusters/my-cluster" already exists - to be 
│ managed via Terraform this resource needs to be imported into the State.
│
│   with module.aks.azurerm_kubernetes_cluster.aks,
│   on ../../modules/aks/main.tf line 13, in resource "azurerm_kubernetes_cluster" "aks":
│   13: resource "azurerm_kubernetes_cluster" "aks" {

Possible causes

◦ Previous Terraform apply failed mid-way. This means that the resource is created, but the state is not saved

◦ Resource created manually in Azure Portal

◦ State file was lost, corrupted, or restored to older version

◦ Resource imported in different workspace

Resolution

1. Import the existing resource into state.

terraform import -var-file="infra.tfvars" \
  "module.aks.azurerm_kubernetes_cluster.aks" \
  "/subscriptions/xxx/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-cluster"

2. Continue with apply.

terraform apply -var-file="infra.tfvars"

Scenario 8 – Subscription quota exceeded

This occurs when Azure rejects resource creation because subscription quota is exceeded.

Example error

│ Error: creating Deployment: unexpected status 400 (400 Bad Request) with error: 
│ InsufficientQuota: This operation require 10000 new capacity in quota 
│ One Thousand Tokens Per Minute - gpt-5-mini - DataZoneStandard, which is bigger 
│ than the current available capacity 3650. The current quota usage is 350 and 
│ the quota limit is 4000.

Possible causes

◦ Requested capacity exceeds subscription quota limit

◦ Other deployments consuming available quota

◦ Region-specific quota limits

◦ New subscription with default (low) quotas

Resolution

Use one of the following suitable resolution.

◦ Reduce capacity in infra.tfvars.

openai_gpt5_mini_capacity = 3000 # Reduce to fit quota

◦ Request quota increase.

a. Go to Azure portal and select Quotas.

b. Search for the resource type. For example, Cognitive Services.

c. Click Request Increase.

d. Submit request and wait for approval.

◦ Use different region as some regions have higher default quotas.

Scenario 9 – Private endpoint or DNS errors

Applications receive HTTP 403 errors when connecting to Azure OpenAI via a private endpoint.

Example error from application logs

{"error": {"code": "403", "message": "Traffic is not from an approved private endpoint."}}

Possible Causes

◦ DNS not fully propagated after fresh deployment.

◦ PTU deployment took longer than expected.

◦ Using Global SKU (GlobalProvisionedManaged) with private endpoints.

◦ Private endpoint connection not approved.

◦ VNet DNS link not completed.

Resolution

1. Verify private endpoint status.

az network private-endpoint show \
  --resource-group "<rg>" \
  --name "<pe-name>" \
  --query "privateLinkServiceConnections[0].privateLinkServiceConnectionState.status"

Expected result: Approved.

2. Verify DNS resolution.

kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
  nslookup <account-name>.openai.azure.com

Expected result: Resolved to private IP, 10.x.x.x.

3. Wait for DNS propagation.

For fresh PTU deployments, wait 10-15 minutes after terraform apply completes.

4. Use DataZone SKUs.

If using Global SKUs, switch to DataZone variants.

openai_gpt5_mini_sku_name = "DataZoneProvisionedManaged" # Instead of GlobalProvisionedManaged

Scenario 10 – Spillover deployment deletion blocked

Terraform cannot delete or modify spillover deployment because PTU deployment references it.

Example error

│ Error: deleting Deployment: unexpected status 409 (Conflict) with error:
│ DeploymentInUse: The deployment 'gpt-5-mini-2025-08-07-spillover' cannot be 
│ deleted because it is referenced by deployment 'gpt-5-mini-2025-08-07' as 
│ spillover deployment.

Possible Causes

◦ Attempting to change spillover SKU in single terraform apply.

◦ PTU deployment still references the spillover via spilloverDeploymentName.

Resolution

1. Destroy PTU deployment first. This removes spillover reference.

terraform destroy \
  -target="module.cognitive_deployment-gpt5-mini.azapi_resource.deployment_with_spillover[0]" \
  -var-file="infra.tfvars"

2. Apply to recreate with new configuration.

terraform apply -var-file="infra.tfvars"

Scenario 11 – Authentication errors

These occur when Terraform cannot authenticate with Azure.

Example errors

◦ Not logged in

│ Error: building AzureRM Client: obtain subscription() from Azure CLI: 
│ parsing json result from the Azure CLI: waiting for the Azure CLI: 
│ exit status 1: ERROR: Please run 'az login' to setup account.

◦ Token expired

│ Error: obtaining Authorization Token: AADSTS700024: Client assertion is not 
│ within its valid time range.

◦ Wrong subscription

│ Error: Subscription not found: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

Possible causes

◦ Not logged in to Azure CLI

◦ Azure CLI token expired

◦ Wrong subscription selected

◦ Service principal credentials expired

Resolution

1. Login to Azure.

az login

2. Set correct subscription.

az account set --subscription "<subscription-id>"

3. Verify the subscription.

az account show

Scenario 12 – Resource not found during destroy

Terraform attempts to destroy a resource that no longer exists in Azure.

Example errors

│ Error: deleting Resource Group: the Resource Group was not found
│ 
│ Resource Group Name: "my-rg"

Possible causes

◦ Resource was manually deleted in Azure portal

◦ Resource deleted by another process or pipeline

◦ Resource name changed outside Terraform

Resolution

1. Remove the missing resource from state.

terraform state rm "azurerm_resource_group.rg"

2. Continue with destroy.

terraform destroy -var-file="infra.tfvars"

Was this helpful?