AI Agents for Non-Techies

Lesson 4. Fault Tolerance: What to Do When Something Breaks#

Goal: learn to build agents that keep working even when things fail.

What Is Fault Tolerance#

Fault tolerance is the ability of a system to keep working when one of its components fails.

Example:

OpenAI API unavailable → agent switches to a backup model (Claude, Gemini)
Google Sheets unavailable → agent saves data to a local file and retries later
Webhook failed → agent retries in 5 seconds

Common Failures and How to Handle Them#

1. API returned an error (500, 503, Rate Limit)

What to do:

Retry: try again after a few seconds
Exponential Backoff: increase delay between attempts (1s, 2s, 4s, 8s)
Fallback: switch to another API

Example (n8n):

Add an HTTP Request node (API call)
In node settings enable Retry On Fail:
- Max Tries: 3
- Wait Between Tries: 2000ms
If the API returns an error, n8n will automatically try 2 more times

2. External service unavailable (Google Sheets, Airtable)

What to do:

Queue: save the request to a queue and process later
Fallback: use backup storage (e.g., save to a local file)

Example:

If Google Sheets is unavailable → agent saves data to a JSON file → later (when Google Sheets is back) the agent reads the JSON and uploads data to the table.

3. Agent didn't understand the user's request

What to do:

Clarification: ask the user to rephrase
Fallback: hand off to a human operator

Example:

User: "I want that thing"
Agent: "I didn't understand your request. Could you clarify what you need?"

If the user can't clarify → agent: "I'll connect you with an operator."

4. Agent exceeded API limit

What to do:

Throttling: add a delay between requests
Fallback: switch to another API

Example:

If OpenAI returns "Rate limit exceeded" → agent switches to Claude API.

How to Implement Retry in n8n#

Step 1. Configure Retry for a node

Open the node (e.g., HTTP Request)
In node settings (right panel) find the Settings section
Enable Retry On Fail
Configure:
- Max Tries: 3 (how many attempts)
- Wait Between Tries: 2000ms (delay between attempts)

Step 2. Configure Error Workflow

Create a new workflow with an Error Trigger
Add error handling logic:
- if error is "Rate limit exceeded" → wait 60 seconds and retry
- if error is "Service unavailable" → switch to backup API
- if other error → send Telegram notification

Graceful Degradation#

Graceful degradation is when the agent keeps working but with reduced functionality.

Example:

main function: generate personalized responses via OpenAI
if OpenAI is unavailable → agent responds with template answers from the knowledge base
user gets a response (not perfect, but still a response)

Implementation:

Main branch: OpenAI → personalized response
Fallback branch: if OpenAI returns error → search knowledge base (Google Sheets) → send template response
Last fallback: if knowledge base is also unavailable → send: "Sorry, the service is temporarily unavailable. Please try again later."