Prompt Flows Getting Stuck in the Running State

0

Azure AI Foundry’s prompt flows occasionally encountering prolonged execution times stuck indefinitely in the ‘Running’ state suggests an underlying systemic issue rather than an isolated code problem. Since these flows consist of just one LLM call and a Python script, and since they previously worked reliably, the degradation points to potential throttling, resource allocation constraints, or backend service instability.

Microsoft’s own troubleshooting documentation identified several culprits. First, stale concurrent runs: Azure AI Foundry enforces concurrency limits, and old pipeline runs from the previous 45 days can consume the available slots, preventing new runs from starting. Second, LLM API throttling – when the Azure OpenAI endpoint returns a 429 error, the flow may queue silently rather than failing fast. Third, Python script nodes that run for too long without explicit timeout handling cause the entire run to stall. Fourth, transient backend instability: Microsoft acknowledged that these hangs can reflect regional service degradation that may not appear in the Azure Service Health dashboard immediately.

One particularly sharp finding: Microsoft’s own guidance said that if a pipeline remained stuck for more than an hour, the right move was to open a support ticket. For teams with production SLAs, that recommendation was inadequate.

  1. Use Azure Monitor’s ‘Prompt Flow’ workbook to track run durations and detect abnormal patterns before users notice.
  2. Implement a wrapper that calls your flow with a hard timeout and retries on failure – don’t rely solely on the platform’s timeout enforcement.
  3. Subscribe to Azure Service Health alerts for Azure Machine Learning and Azure OpenAI in your deployment regions.

The fundamental problem wasn’t the hangs themselves – it was the invisibility. A timeout that surfaces as an error code can be caught, retried, and logged. A hang that shows ‘Running’ forever burns compute quota, blocks monitoring, and leaves on-call engineers debugging what is essentially an absence of information.

Teams running multi-step flows with human users on the other end had no way to distinguish a slow flow from a broken one. Several reported switching to calling the Azure OpenAI API directly rather than using prompt flows as a workaround, accepting the loss of the orchestration layer to regain reliability.

  • Implement client-side timeouts independent of the platform’s timeout mechanism.
  • Set up Azure Monitor alerts for runs that exceed 90 seconds – treat these as incidents even before the run formally fails.
  • For flows with multiple nodes, log the start and completion of each node independently so you can identify exactly where a hang occurs.

Leave a Reply

Your email address will not be published. Required fields are marked *