troubleshooting-astro-deployments
This skill provides Astro CLI commands for diagnosing production Astronomer deployment issues, including viewing component-specific logs from schedulers, workers, webservers, and triggerers; filtering logs by error severity; and searching logs for specific keywords to identify failures. Use it when investigating deployment health, analyzing task execution problems, examining DAG processing errors, or troubleshooting connectivity and import issues in production environments.
git clone --depth 1 https://github.com/astronomer/agents /tmp/troubleshooting-astro-deployments && cp -r /tmp/troubleshooting-astro-deployments/skills/troubleshooting-astro-deployments ~/.claude/skills/troubleshooting-astro-deploymentsskill.md
# Astro Deployment Troubleshooting This skill helps you diagnose and troubleshoot production Astronomer deployments using the Astro CLI. > **For deployment management**, see the **managing-astro-deployments** skill. > **For local development**, see the **managing-astro-local-env** skill. --- ## Quick Health Check Start with these commands to get an overview: ```bash # 1. List deployments to find target astro deployment list # 2. Get deployment overview astro deployment inspect <DEPLOYMENT_ID> # 3. Check for errors astro deployment logs <DEPLOYMENT_ID> --error -c 50 ``` --- ## Viewing Deployment Logs Use `-c` to control log count (default: 500). Log flags cannot be combined — use one component or level flag per command. ### Component-Specific Logs View logs from specific Airflow components: ```bash # Scheduler logs (DAG processing, task scheduling) astro deployment logs <DEPLOYMENT_ID> --scheduler -c 50 # Worker logs (task execution) astro deployment logs <DEPLOYMENT_ID> --workers -c 30 # Webserver logs (UI access, health checks) astro deployment logs <DEPLOYMENT_ID> --webserver -c 30 # Triggerer logs (deferrable operators) astro deployment logs <DEPLOYMENT_ID> --triggerer -c 30 ``` ### Log Level Filtering Filter by severity: ```bash # Error logs only (most useful for troubleshooting) astro deployment logs <DEPLOYMENT_ID> --error -c 30 # Warning logs astro deployment logs <DEPLOYMENT_ID> --warn -c 50 # Info-level logs astro deployment logs <DEPLOYMENT_ID> --info -c 50 ``` ### Search Logs Search for specific keywords: ```bash # Search for specific error astro deployment logs <DEPLOYMENT_ID> --keyword "ConnectionError" # Search for specific DAG astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100 # Find import errors astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError" # Find task failures astro deployment logs <DEPLOYMENT_ID> --error --keyword "Task failed" ``` --- ## Complete Investigation Workflow ### Step 1: Identify the Problem ```bash # List deployments with status astro deployment list # Get deployment details astro deployment inspect <DEPLOYMENT_ID> ``` Look for: - Status: HEALTHY vs UNHEALTHY - Runtime version compatibility - Resource limits (CPU, memory) - Recent deployment timestamp ### Step 2: Check Error Logs ```bash # Start with errors astro deployment logs <DEPLOYMENT_ID> --error -c 50 ``` Look for: - Recurring error patterns - Specific DAGs failing repeatedly - Import errors or syntax errors - Connection or credential errors ### Step 3: Review Scheduler Logs ```bash # Check DAG processing astro deployment logs <DEPLOYMENT_ID> --scheduler -c 30 ``` Look for: - DAG parse errors - Scheduling delays - Task queueing issues ### Step 4: Check Worker Logs ```bash # Check task execution astro deployment logs <DEPLOYMENT_ID> --workers -c 30 ``` Look for: - Task execution failures - Resource exhaustion - Timeout errors ### Step 5: Verify Configuration ```bash # Check environment variables astro deployment variable list --deployment-id <DEPLOYMENT_ID> # Verify deployment settings astro deployment inspect <DEPLOYMENT_ID> ``` Look for: - Missing or incorrect environment variables - Secrets configuration (AIRFLOW__SECRETS__BACKEND) - Connection configuration --- ## Common Investigation Patterns ### Recurring DAG Failures Follow the complete investigation workflow above, then narrow to the specific DAG: ```bash astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100 ``` ### Resource Issues ```bash # 1. Check deployment resource allocation astro deployment inspect <DEPLOYMENT_ID> # Look for: resource_quota_cpu, resource_quota_memory # Worker queue: max_worker_count, worker_type # 2. Check for worker scaling issues astro deployment logs <DEPLOYMENT_ID> --workers -c 50 # 3. Look for out-of-memory errors astro deployment logs <DEPLOYMENT_ID> --error --keyword "memory" ``` ### Configuration Problems ```bash # 1. Review environment variables astro deployment variable list --deployment-id <DEPLOYMENT_ID> # 2. Check for secrets backend configuration # Look for: AIRFLOW__SECRETS__BACKEND, AIRFLOW__SECRETS__BACKEND_KWARGS # 3. Verify deployment settings astro deployment inspect <DEPLOYMENT_ID> # 4. Check webserver logs for auth issues astro deployment logs <DEPLOYMENT_ID> --webserver -c 30 ``` ### Import Errors ```bash # 1. Find import errors astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError" # 2. Check scheduler for parse failures astro deployment logs <DEPLOYMENT_ID> --scheduler --keyword "Failed to import" -c 50 # 3. Verify dependencies were deployed astro deployment inspect <DEPLOYMENT_ID> # Check: current_tag, last deployment timestamp ``` --- ## Environment Variables Management ### List Variables ```bash # List all variables for deployment astro deployment variable list --deployment-id <DEPLOYMENT_ID> # Find specific variable astro deployment variable list --deployment-id <DEPLOYMENT_ID> --key AWS_REGION # Export variables to file astro deployment variable list --deployment-id <DEPLOYMENT_ID> --save --env .env.backup ``` ### Create Variables ```bash # Create regular variable astro deployment variable create --deployment-id <DEPLOYMENT_ID> \ --key API_ENDPOINT \ --value https://api.example.com # Create secret (masked in UI and logs) astro deployment variable create --deployment-id <DEPLOYMENT_ID> \ --key API_KEY \ --value secret123 \ --secret ``` ### Update Variables ```bash # Update existing variable astro deployment variable update --deployment-id <DEPLOYMENT_ID> \ --key API_KEY \ --value newsecret ``` ### Delete Variables ```bash # Delete variable astro deployment variable delete --deployment-id <DEPLOYMENT_ID> --key OLD_KEY ``` **Note**: Variables are available to DAGs as environment variables. Changes require no redeployment. --- ## Key Metrics from `deployment inspect` Focus on these fields when troubleshooting: - **status**: HEA
Add a new method to both Airflow adapters
Add a new MCP tool to server.py
Verify code works with both Airflow 2.x and 3.x
Airflow adapter pattern for v2/v3 API compatibility. Use when working with adapters, version detection, or adding new API methods that need to work across Airflow 2.x and 3.x.
Use when the user needs human-in-the-loop workflows in Airflow (approval/reject, form input, or human-driven branching). Covers ApprovalOperator, HITLOperator, HITLBranchOperator, HITLEntryOperator, HITLTrigger. Requires Airflow 3.1+. Does not cover AI/LLM calls (see airflow-ai).
Build Airflow 3.1+ plugins that embed FastAPI apps, custom UI pages, React components, middleware, macros, and operator links directly into the Airflow UI. Use this skill whenever the user wants to create an Airflow plugin, add a custom UI page or nav entry to Airflow, build FastAPI-backed endpoints inside Airflow, serve static assets from a plugin, embed a React app in the Airflow UI, add middleware to the Airflow API server, create custom operator extra links, or call the Airflow REST API from inside a plugin. Also trigger when the user mentions AirflowPlugin, fastapi_apps, external_views, react_apps, plugin registration, or embedding a web app in Airflow 3.1+. If someone is building anything custom inside Airflow 3.1+ that involves Python and a browser-facing interface, this skill almost certainly applies.
Queries, manages, and troubleshoots Apache Airflow using the af CLI. Covers listing DAGs, triggering runs, reading task logs, diagnosing failures, debugging DAG import errors, checking connections, variables, pools, and monitoring health. Also routes to sub-skills for writing DAGs, debugging, deploying, and migrating Airflow 2 to 3. Use when user mentions "Airflow", "DAG", "DAG run", "task log", "import error", "parse error", "broken DAG", or asks to "trigger a pipeline", "debug import errors", "check Airflow health", "list connections", "retry a run", or any Airflow operation. Do NOT use for warehouse/SQL analytics on Airflow metadata tables — use analyzing-data instead.
Queries data warehouse and answers business questions about data. Handles questions requiring database/warehouse queries including "who uses X", "how many Y", "show me Z", "find customers", "what is the count", data lookups, metrics, trends, or SQL analysis.