When people talk about purifying text, they mean one thing: taking content that looks clean on the surface and removing everything that shouldn't be there — invisible characters, non-standard whitespace, encoding artifacts, and typographic substitutions that break downstream systems.
This is a different problem from spell-checking or grammar correction. The words are fine. The characters encoding them are not.
What Does It Mean to Purify Text?
Purifying text is the process of normalizing a string to contain only the characters you expect — no hidden control characters, no Unicode lookalikes, no smart quotes where straight quotes are required.
The need arises most often when text has passed through:
- An AI language model (GPT, Claude, Gemini, Mistral)
- A word processor (Microsoft Word, Google Docs)
- A web page (copy-pasted from a browser)
- A PDF extraction tool
- A multilingual CMS or translation workflow
Each of these sources introduces different classes of artifacts.
The Four Categories of Text Contamination
Zero-width characters
Zero-width characters have no visible representation but occupy positions in the string. The most common:
| Character | Code point | Source |
|---|---|---|
| Zero-width space | U+200B | Word processors, AI models |
| Zero-width non-joiner | U+200C | Arabic/Persian text tools |
| Zero-width joiner | U+200D | Emoji sequences, right-to-left scripts |
| Soft hyphen | U+00AD | Typesetting systems |
| Word joiner | U+2060 | Microsoft Office |
| Zero-width no-break space | U+FEFF | UTF-8 BOM, legacy encodings |
These characters cause hard-to-diagnose bugs: string length checks pass, regex matches fail, database lookups return no results for strings that look identical.
Non-standard whitespace
Plain text uses U+0020 (regular space). Purified text also needs to be checked for:
- U+00A0 — non-breaking space (very common in AI output, appears after numbers)
- U+202F — narrow no-break space (French typography: before
;,!,?) - U+2009 — thin space (mathematical typesetting)
- U+3000 — ideographic space (CJK content)
These look identical to regular spaces in most editors. They break tokenizers, search indexes, and CSV parsers.
Typographic punctuation
AI models prefer typographic punctuation because their training data — books, articles, well-edited web content — uses it consistently:
""instead of"''instead of'—(em dash) instead of-…(ellipsis, U+2026) instead of...
For human reading, these are correct. For machine processing — JSON, YAML, code, SQL, CSV — they are corruption. A JSON value containing a curly quote will fail to parse. A SQL string containing a smart apostrophe will cause a syntax error if not escaped correctly.
Homoglyphs
Homoglyphs are characters from different scripts that look identical to Latin letters:
- Cyrillic
а(U+0430) looks like Latina(U+0061) - Cyrillic
е(U+0435) looks like Latine(U+0065) - Cyrillic
о(U+043E) looks like Latino(U+006F)
AI models trained on multilingual data occasionally mix scripts within a word. The result is a string that displays correctly but fails dictionary lookup, spell-checking, and exact-match search. Homoglyph substitution is also used deliberately in phishing and SEO spam — purifying text catches these cases.
How to Purify Text Correctly
A correct text purification pipeline does the following in order:
1. Unicode normalization
Run NFC (Canonical Decomposition, followed by Canonical Composition) normalization first. This collapses equivalent representations of the same character into a single canonical form. For example, é can be represented as one code point (U+00E9) or as e + combining accent (U+0065 + U+0301). NFC picks one consistently.
In most languages:
// JavaScript
const normalized = text.normalize('NFC')
# Python
import unicodedata
normalized = unicodedata.normalize('NFC', text)
2. Strip zero-width characters
After normalization, remove the zero-width category. These characters have no legitimate use in plain prose:
// JavaScript — remove common zero-width characters
const cleaned = text.replace(
/[\u200B\u200C\u200D\u00AD\u2060\uFEFF]/g,
''
)
3. Normalize whitespace
Replace all non-standard space variants with regular spaces, then collapse multiple spaces:
const cleaned = text
.replace(/[\u00A0\u202F\u2009\u2008\u2007\u2006\u2005\u2004\u2003\u2002\u2001\u2000\u3000]/g, ' ')
.replace(/ {2,}/g, ' ')
4. Handle typographic punctuation
Whether to replace typographic punctuation depends on the destination:
- Publishing to a CMS or for human reading — leave typographic punctuation alone
- Feeding into an API, database, or code — replace with ASCII equivalents
const ascii = text
.replace(/[\u2018\u2019]/g, "'") // curly single quotes
.replace(/[\u201C\u201D]/g, '"') // curly double quotes
.replace(/\u2014/g, '--') // em dash
.replace(/\u2026/g, '...') // ellipsis
5. Detect homoglyphs
Homoglyph detection requires checking each character's Unicode script property and flagging tokens that mix scripts. This cannot be done with a simple regex — it requires a lookup table of known homoglyphs or a Unicode script-aware library.
What You Should Not Do
Do not strip all non-ASCII characters. That breaks legitimate multilingual content, proper names, and any text containing diacritics or non-Latin scripts.
Do not rely on spell-checkers. Spell-checkers compare whole words against dictionaries. They cannot see zero-width characters, and they typically don't detect homoglyphs unless specifically built for it.
Do not use strip() or trim() alone. These only remove leading and trailing whitespace. They do nothing about embedded artifacts.
When Purification Matters Most
Text purification is critical in these workflows:
- SEO content pipelines — search engines may treat contaminated tokens as different words
- LLM fine-tuning datasets — zero-width characters in training data are reproduced at inference time
- Code generation — AI-generated code with curly quotes in string literals will not compile
- Database storage — homoglyphs cause exact-match queries to fail silently
- API integrations — smart quotes in JSON payloads break downstream parsers
TextPurify scans text at the code point level and removes all four categories of contamination in real time — zero-width characters, non-standard spaces, typographic punctuation, and homoglyphs — directly in your browser.