Module 11Lesson 4

Lesson 4. Fault Tolerance: What to Do When Something Breaks

Hands-on: n8n

Lesson 4. Fault Tolerance: What to Do When Something Breaks#

Goal: learn to build agents that keep working even when things fail.

What Is Fault Tolerance#

Fault tolerance is the ability of a system to keep working when one of its components fails.

Example:

  • OpenAI API unavailable → agent switches to a backup model (Claude, Gemini)
  • Google Sheets unavailable → agent saves data to a local file and retries later
  • Webhook failed → agent retries in 5 seconds

Common Failures and How to Handle Them#

1. API returned an error (500, 503, Rate Limit)

What to do:

  • Retry: try again after a few seconds
  • Exponential Backoff: increase delay between attempts (1s, 2s, 4s, 8s)
  • Fallback: switch to another API

Example (n8n):

  1. Add an HTTP Request node (API call)
  2. In node settings enable Retry On Fail:
    • Max Tries: 3
    • Wait Between Tries: 2000ms
  3. If the API returns an error, n8n will automatically try 2 more times

2. External service unavailable (Google Sheets, Airtable)

What to do:

  • Queue: save the request to a queue and process later
  • Fallback: use backup storage (e.g., save to a local file)

Example:

If Google Sheets is unavailable → agent saves data to a JSON file → later (when Google Sheets is back) the agent reads the JSON and uploads data to the table.

3. Agent didn't understand the user's request

What to do:

  • Clarification: ask the user to rephrase
  • Fallback: hand off to a human operator

Example:

User: "I want that thing"
Agent: "I didn't understand your request. Could you clarify what you need?"

If the user can't clarify → agent: "I'll connect you with an operator."

4. Agent exceeded API limit

What to do:

  • Throttling: add a delay between requests
  • Fallback: switch to another API

Example:

If OpenAI returns "Rate limit exceeded" → agent switches to Claude API.

How to Implement Retry in n8n#

Step 1. Configure Retry for a node

  1. Open the node (e.g., HTTP Request)
  2. In node settings (right panel) find the Settings section
  3. Enable Retry On Fail
  4. Configure:
    • Max Tries: 3 (how many attempts)
    • Wait Between Tries: 2000ms (delay between attempts)

Step 2. Configure Error Workflow

  1. Create a new workflow with an Error Trigger
  2. Add error handling logic:
    • if error is "Rate limit exceeded" → wait 60 seconds and retry
    • if error is "Service unavailable" → switch to backup API
    • if other error → send Telegram notification

Graceful Degradation#

Graceful degradation is when the agent keeps working but with reduced functionality.

Example:

  • main function: generate personalized responses via OpenAI
  • if OpenAI is unavailable → agent responds with template answers from the knowledge base
  • user gets a response (not perfect, but still a response)

Implementation:

  1. Main branch: OpenAI → personalized response
  2. Fallback branch: if OpenAI returns error → search knowledge base (Google Sheets) → send template response
  3. Last fallback: if knowledge base is also unavailable → send: "Sorry, the service is temporarily unavailable. Please try again later."