Lesson 4. Fault Tolerance: What to Do When Something Breaks#
Goal: learn to build agents that keep working even when things fail.
What Is Fault Tolerance#
Fault tolerance is the ability of a system to keep working when one of its components fails.
Example:
- OpenAI API unavailable → agent switches to a backup model (Claude, Gemini)
- Google Sheets unavailable → agent saves data to a local file and retries later
- Webhook failed → agent retries in 5 seconds
Common Failures and How to Handle Them#
1. API returned an error (500, 503, Rate Limit)
What to do:
- Retry: try again after a few seconds
- Exponential Backoff: increase delay between attempts (1s, 2s, 4s, 8s)
- Fallback: switch to another API
Example (n8n):
- Add an HTTP Request node (API call)
- In node settings enable Retry On Fail:
- Max Tries: 3
- Wait Between Tries: 2000ms
- If the API returns an error, n8n will automatically try 2 more times
2. External service unavailable (Google Sheets, Airtable)
What to do:
- Queue: save the request to a queue and process later
- Fallback: use backup storage (e.g., save to a local file)
Example:
If Google Sheets is unavailable → agent saves data to a JSON file → later (when Google Sheets is back) the agent reads the JSON and uploads data to the table.
3. Agent didn't understand the user's request
What to do:
- Clarification: ask the user to rephrase
- Fallback: hand off to a human operator
Example:
User: "I want that thing"
Agent: "I didn't understand your request. Could you clarify what you need?"
If the user can't clarify → agent: "I'll connect you with an operator."
4. Agent exceeded API limit
What to do:
- Throttling: add a delay between requests
- Fallback: switch to another API
Example:
If OpenAI returns "Rate limit exceeded" → agent switches to Claude API.
How to Implement Retry in n8n#
Step 1. Configure Retry for a node
- Open the node (e.g., HTTP Request)
- In node settings (right panel) find the Settings section
- Enable Retry On Fail
- Configure:
- Max Tries: 3 (how many attempts)
- Wait Between Tries: 2000ms (delay between attempts)
Step 2. Configure Error Workflow
- Create a new workflow with an Error Trigger
- Add error handling logic:
- if error is "Rate limit exceeded" → wait 60 seconds and retry
- if error is "Service unavailable" → switch to backup API
- if other error → send Telegram notification
Graceful Degradation#
Graceful degradation is when the agent keeps working but with reduced functionality.
Example:
- main function: generate personalized responses via OpenAI
- if OpenAI is unavailable → agent responds with template answers from the knowledge base
- user gets a response (not perfect, but still a response)
Implementation:
- Main branch: OpenAI → personalized response
- Fallback branch: if OpenAI returns error → search knowledge base (Google Sheets) → send template response
- Last fallback: if knowledge base is also unavailable → send: "Sorry, the service is temporarily unavailable. Please try again later."