How to Purify Text: Removing Hidden Characters from AI and Web Content

When people talk about purifying text, they mean one thing: taking content that looks clean on the surface and removing everything that shouldn't be there — invisible characters, non-standard whitespace, encoding artifacts, and typographic substitutions that break downstream systems.

This is a different problem from spell-checking or grammar correction. The words are fine. The characters encoding them are not.

What Does It Mean to Purify Text?

Purifying text is the process of normalizing a string to contain only the characters you expect — no hidden control characters, no Unicode lookalikes, no smart quotes where straight quotes are required.

The need arises most often when text has passed through:

An AI language model (GPT, Claude, Gemini, Mistral)
A word processor (Microsoft Word, Google Docs)
A web page (copy-pasted from a browser)
A PDF extraction tool
A multilingual CMS or translation workflow

Each of these sources introduces different classes of artifacts.

The Four Categories of Text Contamination

Zero-width characters

Zero-width characters have no visible representation but occupy positions in the string. The most common:

Character	Code point	Source
Zero-width space	U+200B	Word processors, AI models
Zero-width non-joiner	U+200C	Arabic/Persian text tools
Zero-width joiner	U+200D	Emoji sequences, right-to-left scripts
Soft hyphen	U+00AD	Typesetting systems
Word joiner	U+2060	Microsoft Office
Zero-width no-break space	U+FEFF	UTF-8 BOM, legacy encodings

These characters cause hard-to-diagnose bugs: string length checks pass, regex matches fail, database lookups return no results for strings that look identical.

Non-standard whitespace

Plain text uses U+0020 (regular space). Purified text also needs to be checked for:

U+00A0 — non-breaking space (very common in AI output, appears after numbers)
U+202F — narrow no-break space (French typography: before ;, !, ?)
U+2009 — thin space (mathematical typesetting)
U+3000 — ideographic space (CJK content)

These look identical to regular spaces in most editors. They break tokenizers, search indexes, and CSV parsers.

Typographic punctuation

AI models prefer typographic punctuation because their training data — books, articles, well-edited web content — uses it consistently:

" " instead of "
' ' instead of '
— (em dash) instead of -
… (ellipsis, U+2026) instead of ...

For human reading, these are correct. For machine processing — JSON, YAML, code, SQL, CSV — they are corruption. A JSON value containing a curly quote will fail to parse. A SQL string containing a smart apostrophe will cause a syntax error if not escaped correctly.

Homoglyphs

Homoglyphs are characters from different scripts that look identical to Latin letters:

Cyrillic а (U+0430) looks like Latin a (U+0061)
Cyrillic е (U+0435) looks like Latin e (U+0065)
Cyrillic о (U+043E) looks like Latin o (U+006F)

AI models trained on multilingual data occasionally mix scripts within a word. The result is a string that displays correctly but fails dictionary lookup, spell-checking, and exact-match search. Homoglyph substitution is also used deliberately in phishing and SEO spam — purifying text catches these cases.

How to Purify Text Correctly

A correct text purification pipeline does the following in order:

1. Unicode normalization

Run NFC (Canonical Decomposition, followed by Canonical Composition) normalization first. This collapses equivalent representations of the same character into a single canonical form. For example, é can be represented as one code point (U+00E9) or as e + combining accent (U+0065 + U+0301). NFC picks one consistently.

In most languages:

// JavaScript
const normalized = text.normalize('NFC')

# Python
import unicodedata
normalized = unicodedata.normalize('NFC', text)

2. Strip zero-width characters

After normalization, remove the zero-width category. These characters have no legitimate use in plain prose:

// JavaScript — remove common zero-width characters
const cleaned = text.replace(
  /[\u200B\u200C\u200D\u00AD\u2060\uFEFF]/g,
  ''
)

3. Normalize whitespace

Replace all non-standard space variants with regular spaces, then collapse multiple spaces:

const cleaned = text
  .replace(/[\u00A0\u202F\u2009\u2008\u2007\u2006\u2005\u2004\u2003\u2002\u2001\u2000\u3000]/g, ' ')
  .replace(/ {2,}/g, ' ')

4. Handle typographic punctuation

Whether to replace typographic punctuation depends on the destination:

Publishing to a CMS or for human reading — leave typographic punctuation alone
Feeding into an API, database, or code — replace with ASCII equivalents

const ascii = text
  .replace(/[\u2018\u2019]/g, "'")   // curly single quotes
  .replace(/[\u201C\u201D]/g, '"')   // curly double quotes
  .replace(/\u2014/g, '--')          // em dash
  .replace(/\u2026/g, '...')         // ellipsis

5. Detect homoglyphs

Homoglyph detection requires checking each character's Unicode script property and flagging tokens that mix scripts. This cannot be done with a simple regex — it requires a lookup table of known homoglyphs or a Unicode script-aware library.

What You Should Not Do

Do not strip all non-ASCII characters. That breaks legitimate multilingual content, proper names, and any text containing diacritics or non-Latin scripts.

Do not rely on spell-checkers. Spell-checkers compare whole words against dictionaries. They cannot see zero-width characters, and they typically don't detect homoglyphs unless specifically built for it.

Do not use strip() or trim() alone. These only remove leading and trailing whitespace. They do nothing about embedded artifacts.

When Purification Matters Most

Text purification is critical in these workflows:

SEO content pipelines — search engines may treat contaminated tokens as different words
LLM fine-tuning datasets — zero-width characters in training data are reproduced at inference time
Code generation — AI-generated code with curly quotes in string literals will not compile
Database storage — homoglyphs cause exact-match queries to fail silently
API integrations — smart quotes in JSON payloads break downstream parsers

TextPurify scans text at the code point level and removes all four categories of contamination in real time — zero-width characters, non-standard spaces, typographic punctuation, and homoglyphs — directly in your browser.