Module 08Lesson 5

Lesson 5. Principles of Clean Data

Hands-on: Zapier

Lesson 5. Principles of Clean Data#

Goal: learn how to organize data so the agent works reliably and predictably.

What "Clean Data" Means#

Clean data is data that:

  • has a consistent format (dates, phones, statuses)
  • has no empty values in critical fields
  • has no duplicates (two Ivan Petrovs with the same email)
  • has no extra characters (leading/trailing spaces, line breaks)
  • has a predictable structure (the agent always knows where to find each field)

Why This Matters for Agents#

An agent isn't human. It can't "guess." If data is messy:

  • the agent won't find the right record
  • the agent will write data in the wrong format
  • the agent will return an error instead of an answer
  • the integration will break

Example of messy data:

EmailPhoneStatus
ivan@example.com+7 900 123-45-67Active
maria@example.com89007654321active
alex@example.com+7(900)9876543

What's wrong:

  • phones in different formats
  • extra space before email in row 2
  • status "Active" and "active" — the agent treats them as different
  • empty status in row 3

Rules for Clean Data#

1. Consistent format for similar data

  • phones: +79001234567 (always with +7, no spaces or dashes)
  • dates: 2026-01-31 (YYYY-MM-DD format)
  • email: all lowercase, no spaces
  • statuses: predefined list (Active / Lead / Inactive)

2. No empty values in critical fields

If a field matters for the agent (e.g., Email for customer lookup) — it must always be filled.

For optional fields you can leave empty or use a default (e.g., status "Unknown").

3. No duplicates

One customer = one record. If the customer already exists — update their data, don't add a new row.

Use a unique identifier (ID, Email, Phone) to check for existing records before adding.

4. No extra characters

  • leading/trailing spaces
  • line breaks inside cells
  • extra characters (periods, commas, dashes in wrong places)

Use cleanup functions (in Google Sheets: TRIM, CLEAN, LOWER).

5. Headers without special characters

Column names (headers) should be:

  • no spaces (use _ or CamelCase: client_name or ClientName)
  • no special characters (#, %, /, \)
  • no non-Latin characters (if connecting via API, prefer Latin)

Example of clean data:

IDEmailPhoneStatusRegistrationDate
1ivan@example.com+79001234567active2026-01-15
2maria@example.com+79007654321lead2026-01-20
3alex@example.com+79009876543inactive2025-12-10

How to Check Data Before Connecting to an Agent#

Checklist:

  • All critical fields are filled (no empty values)
  • Formats are consistent (phones, dates, email)
  • No duplicates (check by unique field)
  • No extra spaces or characters (use TRIM, CLEAN)
  • Headers without special characters or spaces
  • Statuses/categories from a predefined list

Data Cleanup Tools in Google Sheets#

  • =TRIM(A2) — removes leading and trailing spaces
  • =CLEAN(A2) — removes non-printable characters
  • =LOWER(A2) — converts text to lowercase
  • =SUBSTITUTE(A2, " ", "") — removes all spaces
  • Data → Remove duplicates — removes duplicates