Reference Monitoring Configuration
Terraform for customer-hosted deployment (CHD) deploys observability components for Azure resources. You can use these components to create dashboards and alerts that monitor infrastructure health.
You can define tenant-specific thresholds, routing rules, and escalation policies. To implement production-grade observability, configure Azure Monitor dashboards and alerts to track platform health, workload reliability, latency, error rates, and cost governance.
This topic provides a baseline catalog and reference setup. Your Site Reliability Engineering teams must implement, tune, and operate these controls.
Dashboards
Configure the following dashboards in Azure Monitor or Azure Workbooks. Configure charts and thresholds based on your environment and workloads.
Platform health
Use this dashboard to check overall platform health.
Monitor the following:
Resource Health status for AKS and Cognitive Services or OpenAI resources
Azure Service Health incidents in your region
Activity Log events from the last 24 hours
Count of active and open Azure alerts by severity
AKS operations
Use this dashboard to monitor AKS cluster capacity and workload stability.
Monitor the following:
Node CPU and memory utilization
Pod restarts and workload distribution
CrashLoopBackOff and failed pod trends
OpenAI and Cognitive Services operations
Use this dashboard to monitor OpenAI and Cognitive Services API reliability and performance.
Monitor the following:
Request volume, success rate, and 4xx and 5xx response distribution
Throttling trends and latency percentiles (p50, p95, p99)
Error spikes by operation or endpoint
Cost monitoring
Use built-in or custom Azure dashboards for cost analysis.
Alerts
Assign Azure Action Groups to Azure health alerts. Start with the following alerts and tune thresholds for your environment.
Severity 0 (paging)
Use this severity for platform-down issues that affect most users and require immediate attention.
Configure alerts for the following conditions:
Sustained OpenAI or Cognitive Services 5xx error rates for 5–10 minutes
Sustained OpenAI or Cognitive Services throttling with user impact
Significant diagnostic ingestion drops
AKS node unavailability for 10 minutes during scaling operations
Severity 1 (urgent)
Use this severity for high-risk conditions that can lead to blocking incidents if not addressed.
Configure alerts for the following conditions:
Pod restart spikes in production environments
User node pool CPU or memory saturation lasting 15 minutes
Private endpoint or DNS resolution failures for critical services
AKS control plane or cluster operation failures, such as upgrade or scaling failures
Severity 2 (ticket)
Use this severity for trends that require investigation but do not require immediate paging.
Configure alerts for the following conditions:
Increasing 4xx response trends on OpenAI or Cognitive Services calls
Latency p95 or p99 regressions on OpenAI or Cognitive Services calls
Cost or ingestion anomalies, such as sudden increases or drops in observability cost or data volume
Reference KQL queries
This section provides reference KQL queries that help identify common issues in Azure environments. For more information on KQL queries, refer to the Microsoft documentation: Kusto Query Language (KQL) overview - Kusto | Microsoft Learn.
AKS restart hot spots
KubePodInventory
| where TimeGenerated > ago(30m)
| summarize Restarts = max(ContainerRestartCount) by Namespace, PodName, ContainerName
| where Restarts > 3
| order by Restarts desc
Container error trends
ContainerLog
| where TimeGenerated > ago(30m)
| where LogEntry has_any ("error","exception","fail","timeout","throttle")
| summarize Count = count() by bin(TimeGenerated, 5m), Name
| order by TimeGenerated desc
Change events near incidents
AzureActivity
| where TimeGenerated > ago(2h)
| summarize Changes = count() by OperationNameValue, Caller, ResourceGroup, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
Operational tips
Start with conservative thresholds for one to two weeks, and then tune them based on observed noise and incident patterns.
Use paging alerts (Severity 0) only for customer-impacting or platform-critical failures.
Review noisy alerts regularly and refine or remove them.
Use separate dashboards and Action Groups for each environment (development, test, and production).
Reference documents
Use the following Microsoft Learn documentation as setup guidance for AKS, Azure Monitor, Service Health, Cognitive Services or OpenAI metrics, and cost management.
Core Azure Monitor and alerting
Azure Monitor overview: Azure Monitor overview - Azure Monitor
Azure Monitor Workbooks: Azure Workbooks overview - Azure Monitor
AKS operational dashboards and alerts
Kubernetes monitoring in Azure Monitor: Kubernetes monitoring in Azure Monitor - Azure Monitor
Service and resource health alerts
Error rate, latency, and application telemetry
Cognitive Services and OpenAI health and usage
Observability in generative AI: Observability in Generative AI - Microsoft Foundry
Cognitive Services and OpenAI supported metrics: Supported metrics - Microsoft.CognitiveServices/accounts - Azure Monitor
Cost tracking and budget alerts
Cost Management and Billing overview: Overview of Billing - Microsoft Cost Management
Was this helpful?