Lesson 3. Monitoring: How to Keep an Eye on Agent Health#
Goal: set up simple monitoring so you know when the agent is down or behaving incorrectly.
What Is Monitoring#
Monitoring is continuously watching how the agent works so you notice problems in time:
- agent stopped responding
- agent returns errors
- agent is slow
- agent exceeded API limits
Basic Metrics to Monitor#
1. Uptime (availability)
Percentage of time the agent is working.
Example:
If the agent worked 23 hours out of 24 → Uptime = 95.8%
Target: aim for 99%+ (less than 1% downtime)
2. Response Time
How long the agent takes to process a request.
Example:
User asked a question → agent replied in 3 seconds → Response Time = 3s
Target: under 5 seconds for text requests
3. Error Rate
Percentage of requests that ended in error.
Example:
Out of 100 requests, 5 failed → Error Rate = 5%
Target: under 1% (99% of requests succeed)
4. Request Rate
How many requests the agent handles per hour / day.
Example:
Agent processed 500 requests in a day → Request Rate = 500/day
Target: track growth (if requests spike, you need to scale)
How to Set Up Simple Monitoring#
Option 1: Uptime Monitoring (for webhook bots)
If your agent works via webhook (e.g., Telegram bot on n8n), use an availability checker:
- UptimeRobot (free for up to 50 monitors)
- Pingdom (paid, more powerful)
- Healthchecks.io (simple and free)
How it works:
- You give the service your webhook URL
- The service sends a test request every 5 minutes
- If the webhook doesn't respond → the service sends you a notification (email, SMS, Telegram)
Setup in UptimeRobot:
- Sign up at uptimerobot.com
- Add a new monitor: your webhook URL, type: HTTP(s)
- Configure notifications (email or Telegram)
- Save
You'll now get notified if the agent stops responding.
Option 2: Log-Based Monitoring
If you log agent actions in Google Sheets / Airtable, set up automatic checks:
Example: error threshold notification
Logic:
- Every hour (or once a day) a workflow runs (Zapier / n8n)
- Workflow reads logs from the last hour
- Counts errors (
Status = failed) - If errors > 10 → sends a Telegram notification: "Attention! 15 errors in the last hour. Check your agent."
Implementation in n8n:
- Trigger: Cron (every hour)
- Action 1: Google Sheets → Read (logs from last hour)
- Action 2: Function (count
Status = failed) - Action 3: IF (if errors > 10)
- Action 4 (true): Telegram → Send Message ("Attention! ...")
Option 3: Built-in Platform Tools
Many platforms have built-in monitoring:
- Zapier: Task History + Email Alerts (Zapier sends email on error)
- Make: History + Notifications
- n8n: Error Workflow (a workflow that runs when an error occurs)
Example: Error Workflow in n8n
- Create a new workflow with an Error Trigger
- Add a Telegram → Send Message node
- Configure the message: "Error in workflow [name]. Details: {{ $json.error.message }}"
- Save and activate
Now you'll get a Telegram notification for any error in any workflow.
What to Do When Monitoring Shows a Problem#
Problem 1: agent not responding (Uptime = 0%)
Possible causes:
- server down (if self-hosted)
- account balance depleted (if cloud platform)
- webhook broken (wrong URL, expired SSL certificate)
What to do:
- Check server / platform status
- Check account balance
- Check webhook (send a test request manually)
- Restart the workflow / bot
Problem 2: high Error Rate (>5%)
Possible causes:
- API issues (rate limit exceeded, API unavailable)
- invalid data (e.g., wrong email format)
- logic error in the agent
What to do:
- Open logs, find errors
- Check ErrorMessage
- Fix the issue (increase limits, fix data, fix logic)
- Test
Problem 3: slow Response Time (>10 seconds)
Possible causes:
- slow API (e.g., OpenAI overloaded)
- too many steps in the workflow
- no caching
What to do:
- Measure time for each step (in n8n this is visible in Executions)
- Find the slowest step
- Optimize (caching, faster API, parallel requests)