Reference Monitoring Configuration

Welcome to Codebeamer AI Help Center > Setting Up the Codebeamer AI > Customer-hosted Codebeamer AI > Recommendations and Best Practices > Reference Monitoring Configuration

Terraform for customer-hosted deployment ( customer hosted deployment) deploys observability components for Azure resources. You can use these components to create dashboards and alerts that monitor infrastructure health.

You can define tenant-specific thresholds, routing rules, and escalation policies. To implement production-grade observability, configure Azure Monitor dashboards and alerts to track platform health, workload reliability, latency, error rates, and cost governance.

This topic provides a baseline catalog and reference setup. Your Site Reliability Engineering teams must implement, tune, and operate these controls.

Dashboards

Configure the following dashboards in Azure Monitor or Azure Workbooks. Configure charts and thresholds based on your environment and workloads.

Platform health

Use this dashboard to check overall platform health.

Monitor the following:

◦ Resource Health status for AKS and Cognitive Services or OpenAI resources

◦ Azure Service Health incidents in your region

◦ Activity Log events from the last 24 hours

◦ Count of active and open Azure alerts by severity

AKS operations

Use this dashboard to monitor AKS cluster capacity and workload stability.

Monitor the following:

◦ Node CPU and memory utilization

◦ Pod restarts and workload distribution

◦ CrashLoopBackOff and failed pod trends

OpenAI and Cognitive Services operations

Use this dashboard to monitor OpenAI and Cognitive Services API reliability and performance.

Monitor the following:

◦ Request volume, success rate, and 4xx and 5xx response distribution

◦ Throttling trends and latency percentiles (p50, p95, p99)

◦ Error spikes by operation or endpoint

Cost monitoring

Use built-in or custom Azure dashboards for cost analysis.

Alerts

Assign Azure Action Groups to Azure health alerts. Start with the following alerts and tune thresholds for your environment.

Severity 0 (paging)

Use this severity for platform-down issues that affect most users and require immediate attention.

Configure alerts for the following conditions:

◦ Sustained OpenAI or Cognitive Services 5xx error rates for 5–10 minutes

◦ Sustained OpenAI or Cognitive Services throttling with user impact

◦ Significant diagnostic ingestion drops

◦ AKS node unavailability for 10 minutes during scaling operations

Severity 1 (urgent)

Use this severity for high-risk conditions that can lead to blocking incidents if not addressed.

Configure alerts for the following conditions:

◦ Pod restart spikes in production environments

◦ User node pool CPU or memory saturation lasting 15 minutes

◦ Private endpoint or DNS resolution failures for critical services

◦ AKS control plane or cluster operation failures, such as upgrade or scaling failures

Severity 2 (ticket)

Use this severity for trends that require investigation but do not require immediate paging.

Configure alerts for the following conditions:

◦ Increasing 4xx response trends on OpenAI or Cognitive Services calls

◦ Latency p95 or p99 regressions on OpenAI or Cognitive Services calls

◦ Cost or ingestion anomalies, such as sudden increases or drops in observability cost or data volume

Reference KQL queries

This section provides reference KQL queries that help identify common issues in Azure environments. For more information on KQL queries, refer to the Microsoft documentation: Kusto Query Language (KQL) overview - Kusto | Microsoft Learn.

AKS restart hot spots

KubePodInventory
| where TimeGenerated > ago(30m)
| summarize Restarts = max(ContainerRestartCount) by Namespace, PodName, ContainerName
| where Restarts > 3
| order by Restarts desc

Container error trends

ContainerLog
| where TimeGenerated > ago(30m)
| where LogEntry has_any ("error","exception","fail","timeout","throttle")
| summarize Count = count() by bin(TimeGenerated, 5m), Name
| order by TimeGenerated desc

Change events near incidents

AzureActivity
| where TimeGenerated > ago(2h)
| summarize Changes = count() by OperationNameValue, Caller, ResourceGroup, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

Operational tips

• Start with conservative thresholds for one to two weeks, and then tune them based on observed noise and incident patterns.

• Use paging alerts (Severity 0) only for customer-impacting or platform-critical failures.

• Review noisy alerts regularly and refine or remove them.

• Use separate dashboards and Action Groups for each environment (development, test, and production).

Reference documents

Use the following Microsoft Learn documentation as setup guidance for AKS, Azure Monitor, Service Health, Cognitive Services or OpenAI metrics, and cost management.

Core Azure Monitor and alerting

◦ Azure Monitor overview: Azure Monitor overview - Azure Monitor

◦ Create metric alerts: Create Azure Monitor metric alert rules - Azure Monitor

◦ Create log alerts: Create Azure Monitor log search alert rules - Azure Monitor

◦ Azure Monitor Workbooks: Azure Workbooks overview - Azure Monitor

AKS operational dashboards and alerts

◦ Kubernetes monitoring in Azure Monitor: Kubernetes monitoring in Azure Monitor - Azure Monitor

◦ Enable monitoring for AKS: Enable Monitoring for Azure Kubernetes Service (AKS) Clusters - Azure Monitor

◦ AKS supported metrics: Supported metrics - Microsoft.ContainerService/managedClusters - Azure Monitor

◦ Prometheus and Grafana integration for AKS: Enable Monitoring for Azure Kubernetes Service (AKS) Clusters - Azure Monitor

Service and resource health alerts

◦ Service Health overview: What is Azure Service Health? - Azure Service Health