Lesson 5. Principles of Clean Data#
Goal: learn how to organize data so the agent works reliably and predictably.
What "Clean Data" Means#
Clean data is data that:
- has a consistent format (dates, phones, statuses)
- has no empty values in critical fields
- has no duplicates (two Ivan Petrovs with the same email)
- has no extra characters (leading/trailing spaces, line breaks)
- has a predictable structure (the agent always knows where to find each field)
Why This Matters for Agents#
An agent isn't human. It can't "guess." If data is messy:
- the agent won't find the right record
- the agent will write data in the wrong format
- the agent will return an error instead of an answer
- the integration will break
Example of messy data:
| Phone | Status | |
|---|---|---|
| ivan@example.com | +7 900 123-45-67 | Active |
| maria@example.com | 89007654321 | active |
| alex@example.com | +7(900)9876543 |
What's wrong:
- phones in different formats
- extra space before email in row 2
- status "Active" and "active" — the agent treats them as different
- empty status in row 3
Rules for Clean Data#
1. Consistent format for similar data
- phones:
+79001234567(always with+7, no spaces or dashes) - dates:
2026-01-31(YYYY-MM-DD format) - email: all lowercase, no spaces
- statuses: predefined list (Active / Lead / Inactive)
2. No empty values in critical fields
If a field matters for the agent (e.g., Email for customer lookup) — it must always be filled.
For optional fields you can leave empty or use a default (e.g., status "Unknown").
3. No duplicates
One customer = one record. If the customer already exists — update their data, don't add a new row.
Use a unique identifier (ID, Email, Phone) to check for existing records before adding.
4. No extra characters
- leading/trailing spaces
- line breaks inside cells
- extra characters (periods, commas, dashes in wrong places)
Use cleanup functions (in Google Sheets: TRIM, CLEAN, LOWER).
5. Headers without special characters
Column names (headers) should be:
- no spaces (use
_or CamelCase:client_nameorClientName) - no special characters (
#,%,/,\) - no non-Latin characters (if connecting via API, prefer Latin)
Example of clean data:
| ID | Phone | Status | RegistrationDate | |
|---|---|---|---|---|
| 1 | ivan@example.com | +79001234567 | active | 2026-01-15 |
| 2 | maria@example.com | +79007654321 | lead | 2026-01-20 |
| 3 | alex@example.com | +79009876543 | inactive | 2025-12-10 |
How to Check Data Before Connecting to an Agent#
Checklist:
- All critical fields are filled (no empty values)
- Formats are consistent (phones, dates, email)
- No duplicates (check by unique field)
- No extra spaces or characters (use
TRIM,CLEAN) - Headers without special characters or spaces
- Statuses/categories from a predefined list
Data Cleanup Tools in Google Sheets#
=TRIM(A2)— removes leading and trailing spaces=CLEAN(A2)— removes non-printable characters=LOWER(A2)— converts text to lowercase=SUBSTITUTE(A2, " ", "")— removes all spacesData → Remove duplicates— removes duplicates