Module 11Lesson 3

Lesson 3. Monitoring: How to Keep an Eye on Agent Health

Hands-on: n8n

Lesson 3. Monitoring: How to Keep an Eye on Agent Health#

Goal: set up simple monitoring so you know when the agent is down or behaving incorrectly.

What Is Monitoring#

Monitoring is continuously watching how the agent works so you notice problems in time:

  • agent stopped responding
  • agent returns errors
  • agent is slow
  • agent exceeded API limits

Basic Metrics to Monitor#

1. Uptime (availability)

Percentage of time the agent is working.

Example:
If the agent worked 23 hours out of 24 → Uptime = 95.8%

Target: aim for 99%+ (less than 1% downtime)

2. Response Time

How long the agent takes to process a request.

Example:
User asked a question → agent replied in 3 seconds → Response Time = 3s

Target: under 5 seconds for text requests

3. Error Rate

Percentage of requests that ended in error.

Example:
Out of 100 requests, 5 failed → Error Rate = 5%

Target: under 1% (99% of requests succeed)

4. Request Rate

How many requests the agent handles per hour / day.

Example:
Agent processed 500 requests in a day → Request Rate = 500/day

Target: track growth (if requests spike, you need to scale)

How to Set Up Simple Monitoring#

Option 1: Uptime Monitoring (for webhook bots)

If your agent works via webhook (e.g., Telegram bot on n8n), use an availability checker:

  • UptimeRobot (free for up to 50 monitors)
  • Pingdom (paid, more powerful)
  • Healthchecks.io (simple and free)

How it works:

  1. You give the service your webhook URL
  2. The service sends a test request every 5 minutes
  3. If the webhook doesn't respond → the service sends you a notification (email, SMS, Telegram)

Setup in UptimeRobot:

  1. Sign up at uptimerobot.com
  2. Add a new monitor: your webhook URL, type: HTTP(s)
  3. Configure notifications (email or Telegram)
  4. Save

You'll now get notified if the agent stops responding.

Option 2: Log-Based Monitoring

If you log agent actions in Google Sheets / Airtable, set up automatic checks:

Example: error threshold notification

Logic:

  1. Every hour (or once a day) a workflow runs (Zapier / n8n)
  2. Workflow reads logs from the last hour
  3. Counts errors (Status = failed)
  4. If errors > 10 → sends a Telegram notification: "Attention! 15 errors in the last hour. Check your agent."

Implementation in n8n:

  • Trigger: Cron (every hour)
  • Action 1: Google Sheets → Read (logs from last hour)
  • Action 2: Function (count Status = failed)
  • Action 3: IF (if errors > 10)
  • Action 4 (true): Telegram → Send Message ("Attention! ...")

Option 3: Built-in Platform Tools

Many platforms have built-in monitoring:

  • Zapier: Task History + Email Alerts (Zapier sends email on error)
  • Make: History + Notifications
  • n8n: Error Workflow (a workflow that runs when an error occurs)

Example: Error Workflow in n8n

  1. Create a new workflow with an Error Trigger
  2. Add a Telegram → Send Message node
  3. Configure the message: "Error in workflow [name]. Details: {{ $json.error.message }}"
  4. Save and activate

Now you'll get a Telegram notification for any error in any workflow.

What to Do When Monitoring Shows a Problem#

Problem 1: agent not responding (Uptime = 0%)

Possible causes:

  • server down (if self-hosted)
  • account balance depleted (if cloud platform)
  • webhook broken (wrong URL, expired SSL certificate)

What to do:

  1. Check server / platform status
  2. Check account balance
  3. Check webhook (send a test request manually)
  4. Restart the workflow / bot

Problem 2: high Error Rate (>5%)

Possible causes:

  • API issues (rate limit exceeded, API unavailable)
  • invalid data (e.g., wrong email format)
  • logic error in the agent

What to do:

  1. Open logs, find errors
  2. Check ErrorMessage
  3. Fix the issue (increase limits, fix data, fix logic)
  4. Test

Problem 3: slow Response Time (>10 seconds)

Possible causes:

  • slow API (e.g., OpenAI overloaded)
  • too many steps in the workflow
  • no caching

What to do:

  1. Measure time for each step (in n8n this is visible in Executions)
  2. Find the slowest step
  3. Optimize (caching, faster API, parallel requests)