Troubleshooting
This topic describes common failure scenarios when deploying Codebeamer AI in a customer-hosted environment using Terraform, and explains the correct recovery actions for each case.
Terraform deployment rollback and state recovery
When you use Terraform with a remote Azure Storage backend, deployments can fail for various reasons. Understanding whether Terraform state is valid is critical to choose the correct recovery approach.
In most cases, rolling back Terraform code and reapplying the configuration is the correct approach. Blob versioning must only be used when the state file itself becomes corrupted or unreadable.
Terraform failure scenarios and correct recovery approach
The following scenarios describe how Terraform behaves during failures and the correct recovery method.
Scenario
State status
Should previous state be restored?
Correct recovery approach
Terraform apply fails completely. Resources are not created.
State unchanged
No
Revert code and run Terraform plan.
Apply partially succeeds, In this case, some resources are created.
State partially updated but accurate
No
Revert code and rerun Terraform apply.
Apply crashes during state write.
State corrupted or unreadable
Yes
Restore the previous blob version and rerun Terraform.
Apply succeeds but deployment must be undone.
State valid
No
Revert code and rerun Terraform apply.
Scenario 1 — Terraform apply fails completely
Terraform attempts to create a resource, but Azure rejected the request before any resources are created.
Common causes
Invalid configuration
Quota exceeded
Invalid model name
Permission issues
Example situation
Component
Status
Code
Contains new resource
State
Unchanged
Azure
No resource created
Terraform only updates the state after successful operations, therefore the state remains unchanged.
Example
Code: gpt-5-nano deployment added
State: V1 (unchanged)
Azure: resource does not exist
Resolution
Revert the code change.
#if using git...
git checkout <earlier_branch/earlier_tag>
terraform plan -var-file=infra.tfvars
Expected result: No changes
State Restore Required: No
Scenario 2 – Terraform apply partially succeeds
Terraform successfully creates some resources and then fails. Terraform writes state after every successful operation, so the state accurately reflects the deployed resources.
Example scenario
Component
Status
Code
Contains resource1, resource2, resource3
State
Contains resource1 and resource2
Azure
resource1 and resource2 exist, resource3 was not created
For example:
Code: resource1, resource2, resource3
State: resource1, resource2
Azure: resource1, resource2
The deployment failed when Terraform attempted to create resource3.
Option 1–Fix the issue and rerun Terraform
Use this option when the failure was caused by a temporary or correctable issue, such as a quota limit, temporary API error, or misconfiguration.
terraform apply -var-file=infra.tfvars
Terraform then:
1. Detects resource1 and resource2 already exist in state.
2. Skips resource1 and resource2.
3. Attempts to create resource3 again.
Result:
Component
Status
State
resource1, resource2, resource3
Azure
resource1, resource2, resource3
This continues the deployment from where it failed.
Option 2 – Revert the code
Use this option when the deployment itself is incorrect. For example, a model deployment was mistakenly added.
#if using git...
git checkout <earlier_branch/earlier_tag>
terraform apply -var-file=infra.tfvars
This results in the following state:
Component
Status
Code
resource1 and resource2 removed
State
resource1 and resource2 exist
Azure
resource1 and resource2 exist
Terraform detects that the resources exist in state but no longer exist in code, so it destroys them.
Terraform plan shows the following:
- destroy resource1
- destroy resource2
After apply, the result is as follows:
Component
Status
State
clean
Azure
resource1 and resource2 removed
Scenario 3 – Terraform state corruption recovery using Azure Blob Versioning
Terraform relies on the state file to map infrastructure resources. If the state file becomes corrupted or unreadable, Terraform commands such as plan, apply, and destroy fail.
To recover from this situation, the previous version of the state file must be restored from Azure Blob Versioning. Blob versioning preserves historical copies of the state file every time it is modified, allowing safe recovery from corruption.
Possible causes
Terraform process interruption during state write
Terminal crash during terraform apply
Network interruption during state upload
This results in the following state:
Component
Status
Code
May contain new resources
State
Corrupted or unreadable
Azure
Resources may or may not exist
Example of Terraform error
terraform plan -var-file=infra.tfvars

│ Error: Unsupported state file format

│ The version in the state file is string. A positive whole number is required.


│ Error: Unsupported state file format

│ The state file does not have a "version" attribute, which is required to identify the format version.
OR
$ terraform plan -var-file=infra.tfvars
│ Error: Unsupported state file format

│ The state file could not be parsed as JSON: syntax error at
│ byte offset 7805.
This indicates the state file is no longer valid JSON.
Recovery procedure
1. Confirm state corruption.
terraform plan -var-file=infra.tfvars
Expected error:
terraform plan -var-file=infra.tfvars

│ Error: Unsupported state file format

│ The version in the state file is string. A positive whole number is required.


│ Error: Unsupported state file format

│ The state file does not have a "version" attribute, which is required to identify the format version.
OR
$ terraform plan -var-file=infra.tfvars
│ Error: Unsupported state file format

│ The state file could not be parsed as JSON: syntax error at
│ byte offset 7805.
This confirms that the state file is corrupted.
2. List available blob versions.
Blob versioning automatically restores previous state file versions.
az storage blob list \
--container-name <container_name>\
--account-name <storage_account_name> \
--include v \
--auth-mode login \
--query "[?name=='terraform.tfstate'].{name:name, Version:versionId, Modified:properties.lastModified}" \
--output table
Example output:
Name
Version
Modified
terraform.tfstate
2026-03-13T11:00:00.0000000Z
2026-03-13T11:00:00
terraform.tfstate
2026-03-13T11:05:00.0000000Z
2026-03-13T11:05:00
Identify and copy the last known good version ID. The older version is the valid state and the latest version is the corrupted state
3. Download the last known good version.
Download the valid state file from blob versioning.
az storage blob download \
--container-name <container_name> \
--account-name <storage_account_name> \
--name terraform.tfstate \
--version-id <version_id> \
--file recovered-state.json \
--auth-mode login
4. Restore the state file.
Upload the recovered state file to replace the corrupted blob.
az storage blob upload \
--container-name <container_name> \
--account-name <storage_account_name> \
--name terraform.tfstate \
--file recovered-state.json \
--overwrite \
--auth-mode login
This restores the working Terraform state.
5. Verify Terraform operation.
terraform plan -var-file=infra.tfvars
This confirms the state recovery was successful.
6. Confirm version history is preserved.
List blob versions again.
az storage blob list \
--container-name <container_name>\
--account-name <storage_account_name> \
--include v \
--auth-mode login \
--query "[?name=='terraform.tfstate'].{name:name, Version:versionId, Modified:properties.lastModified}" \
--output table
Example:
Version
Description
V1
Original valid state
V2
Corrupted state
V3
Restored valid state
Blob versioning preserves the entire audit trail of state changes.
Test Validation Results
Test step
Expected outcome
Corrupt state file
Terraform fails to read state
Check blob versions
Previous state version exists
Download valid version
Valid JSON state retrieved
Restore state file
Blob replaced successfully
Run Terraform plan
Terraform fully operational
Result: State successfully recovered using blob versioning without manual resource import.
Scenario 4 – Deployment succeeded but must be reverted
Terraform apply succeeded, but the deployment must be undone.
Example
Component
Status
Code
Contains gpt-5-nano
State
Contains gpt-5-nano
Azure
gpt-5-nano exists
Resolution
Reverse Terraform code, and reapply.
#if using git...
git checkout <earlier_branch/earlier_tag>
terraform apply -var-file=infra.tfvars
Terraform detects that the resource exists in state but not in code and deletes it.
Scenario 5 – State lock error
Terraform cannot acquire the state lock because a previous operation was interrupted or is still running.
Users usually encounter this when retrying apply or destroy.
Example error
│Error: Error acquiring the state lock

│ Error message: state blob is already locked
│ Lock Info:
│ ID: 47befb15-5e0e-908e-d1cd-298e3c723f3d
│ Path: tfstate/infra.tfstate
│ Operation: OperationTypePlan
│ Who: user@machine
│ Version: 1.14.0
│ Created: 2026-04-09 07:02:45 +0000 UTC
Possible causes
Previous Terraform command was interrupted
Terminal or SSH session crashed during operation
Network disconnection during state write
Another user or pipeline running Terraform concurrently
Resolution
1. Break the existing lease.
az storage blob lease break \
--account-name "<storage-account>" \
--container-name "<container>" \
--blob-name "<state-file>.tfstate" \
--auth-mode login
2. Rerun Terraform.
Scenario 6 – Network or TLS errors
Terraform fails due to network connectivity issues with Azure APIs.
Example errors
TLS handshake timeout
│ Error: creating Private Endpoint: Put "https://management.azure.com/...": 
│ net/http: TLS handshake timeout
Connection reset
│ Error: creating AKS Cluster: HTTP response was nil; connection may have been reset
Context deadline exceeded
│ Error: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Possible causes
Unstable network connection
VPN or proxy interference
Azure API temporary issues
Corporate firewall blocking requests
Request timeout too short for slow operations
Resolution
Use one of the following suitable resolution.
Retry. This is the most common fix.
terraform apply -var-file="infra.tfvars"
Increase timeout.
$env:ARM_CLIENT_TIMEOUT_SECONDS = "3600"
terraform apply -var-file="infra.tfvars"
Check network.
Disable VPN temporarily
Check proxy settings
Verify firewall rules allow Azure management endpoints
Scenario 7 – Resource already exists
Terraform attempts to create a resource that already exists in Azure but is not tracked in state.
Example error
│ Error: a resource with the ID "/subscriptions/xxx/resourceGroups/my-rg/providers/
│ Microsoft.ContainerService/managedClusters/my-cluster" already exists - to be
│ managed via Terraform this resource needs to be imported into the State.

│ with module.aks.azurerm_kubernetes_cluster.aks,
│ on ../../modules/aks/main.tf line 13, in resource "azurerm_kubernetes_cluster" "aks":
│ 13: resource "azurerm_kubernetes_cluster" "aks" {
Possible causes
Previous Terraform apply failed mid-way. This means that the resource is created, but the state is not saved
Resource created manually in Azure Portal
State file was lost, corrupted, or restored to older version
Resource imported in different workspace
Resolution
1. Import the existing resource into state.
terraform import -var-file="infra.tfvars" \
"module.aks.azurerm_kubernetes_cluster.aks" \
"/subscriptions/xxx/resourceGroups/my-rg/providers/Microsoft.ContainerService/managedClusters/my-cluster"
2. Continue with apply.
terraform apply -var-file="infra.tfvars"
Scenario 8 – Subscription quota exceeded
This occurs when Azure rejects resource creation because subscription quota is exceeded.
Example error
│ Error: creating Deployment: unexpected status 400 (400 Bad Request) with error: 
│ InsufficientQuota: This operation require 10000 new capacity in quota
│ One Thousand Tokens Per Minute - gpt-5-mini - DataZoneStandard, which is bigger
│ than the current available capacity 3650. The current quota usage is 350 and
│ the quota limit is 4000.
Possible causes
Requested capacity exceeds subscription quota limit
Other deployments consuming available quota
Region-specific quota limits
New subscription with default (low) quotas
Resolution
Use one of the following suitable resolution.
Reduce capacity in infra.tfvars.
openai_gpt5_mini_capacity = 3000 # Reduce to fit quota
Request quota increase.
a. Go to Azure portal and select Quotas.
b. Search for the resource type. For example, Cognitive Services.
c. Click Request Increase.
d. Submit request and wait for approval.
Use different region as some regions have higher default quotas.
Scenario 9 – Private endpoint or DNS errors
Applications receive HTTP 403 errors when connecting to Azure OpenAI via a private endpoint.
Example error from application logs
{"error": {"code": "403", "message": "Traffic is not from an approved private endpoint."}}
Possible Causes
DNS not fully propagated after fresh deployment.
PTU deployment took longer than expected.
Using Global SKU (GlobalProvisionedManaged) with private endpoints.
Private endpoint connection not approved.
VNet DNS link not completed.
Resolution
1. Verify private endpoint status.
az network private-endpoint show \
--resource-group "<rg>" \
--name "<pe-name>" \
--query "privateLinkServiceConnections[0].privateLinkServiceConnectionState.status"
Expected result: Approved.
2. Verify DNS resolution.
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
nslookup <account-name>.openai.azure.com
Expected result: Resolved to private IP, 10.x.x.x.
3. Wait for DNS propagation.
For fresh PTU deployments, wait 10-15 minutes after terraform apply completes.
4. Use DataZone SKUs.
If using Global SKUs, switch to DataZone variants.
openai_gpt5_mini_sku_name = "DataZoneProvisionedManaged" # Instead of GlobalProvisionedManaged
Scenario 10 – Spillover deployment deletion blocked
Terraform cannot delete or modify spillover deployment because PTU deployment references it.
Example error
│ Error: deleting Deployment: unexpected status 409 (Conflict) with error:
│ DeploymentInUse: The deployment 'gpt-5-mini-2025-08-07-spillover' cannot be
│ deleted because it is referenced by deployment 'gpt-5-mini-2025-08-07' as
│ spillover deployment.
Possible Causes
Attempting to change spillover SKU in single terraform apply.
PTU deployment still references the spillover via spilloverDeploymentName.
Resolution
1. Destroy PTU deployment first. This removes spillover reference.
terraform destroy \
-target="module.cognitive_deployment-gpt5-mini.azapi_resource.deployment_with_spillover[0]" \
-var-file="infra.tfvars"
2. Apply to recreate with new configuration.
terraform apply -var-file="infra.tfvars"
Scenario 11 – Authentication errors
These occur when Terraform cannot authenticate with Azure.
Example errors
Not logged in
│ Error: building AzureRM Client: obtain subscription() from Azure CLI: 
│ parsing json result from the Azure CLI: waiting for the Azure CLI:
│ exit status 1: ERROR: Please run 'az login' to setup account.
Token expired
│ Error: obtaining Authorization Token: AADSTS700024: Client assertion is not 
│ within its valid time range.
Wrong subscription
│ Error: Subscription not found: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
Possible causes
Not logged in to Azure CLI
Azure CLI token expired
Wrong subscription selected
Service principal credentials expired
Resolution
1. Login to Azure.
az login
2. Set correct subscription.
az account set --subscription "<subscription-id>"
3. Verify the subscription.
az account show
Scenario 12 – Resource not found during destroy
Terraform attempts to destroy a resource that no longer exists in Azure.
Example errors
│ Error: deleting Resource Group: the Resource Group was not found

│ Resource Group Name: "my-rg"
Possible causes
Resource was manually deleted in Azure portal
Resource deleted by another process or pipeline
Resource name changed outside Terraform
Resolution
1. Remove the missing resource from state.
terraform state rm "azurerm_resource_group.rg"
2. Continue with destroy.
terraform destroy -var-file="infra.tfvars"
Was this helpful?