When you ask a language model to write something, you expect words. What you actually get is a sequence of tokens — and those tokens can encode Unicode characters that have no visible representation, characters that look like one thing but are coded as another, and punctuation that behaves differently from what you'd type on a keyboard.
This isn't a bug. It's a direct consequence of how language models are trained.
How LLMs Learn Text Representation
Large language models learn from text scraped from the internet, books, and other sources. That training data is not sanitized ASCII. It contains:
- Wikipedia articles with Unicode formatting
- Web pages with non-breaking spaces and smart quotes
- Academic papers with special mathematical and typographic characters
- Multilingual content mixing scripts
- HTML-extracted text with residual entities and artifacts
The model learns the statistical distribution of these characters in context. When it generates text, it samples from learned distributions — including distributions over unusual Unicode characters that appear in its training data.
The Tokenization Problem
Modern LLMs use subword tokenization (BPE or SentencePiece). The tokenizer breaks text into chunks, and the model generates tokens rather than characters directly.
Here's where hidden characters enter:
Token boundaries don't align with word boundaries. A word like "don't" might be tokenized as don, ', t — using a Unicode right apostrophe (U+2019) rather than a straight apostrophe (U+0027), because that's what appeared most often in the training data.
Whitespace tokens vary. The model learned that certain positions call for U+00A0 (non-breaking space) rather than U+0020 (regular space) — because in its training data, that character appeared after numbers, between abbreviations, or in specific formatting contexts.
Zero-width characters are part of the vocabulary. U+200B (zero-width space), U+200C (zero-width non-joiner), and U+200D (zero-width joiner) appear in training data as formatting tools — for example, to prevent ligatures in certain scripts, or as copy-paste artifacts from word processors. The model has assigned them probabilities in certain contexts.
The Typographic Defaults
Language models produce "typographic" punctuation by default because typographic punctuation is more common in their training data than ASCII punctuation.
Books and articles use:
"and"(curly double quotes) more than"'and'(curly single quotes) more than'—(em dash) more than-…(ellipsis character) more than...
The model learned that professional writing uses these forms. When it generates text meant to sound professional, it reaches for the forms it saw most often in that register.
This is fine for human reading. It's a problem for:
- JSON and APIs — smart quotes and em dashes inside JSON strings break parsers
- Databases — search queries on plain quote characters won't match stored smart quotes
- Code — if model-generated code contains curly quotes inside string literals, it won't compile
- SEO pipelines — some indexing tools treat typographic punctuation as separate tokens
Homoglyphs: The Cross-Script Problem
This one is subtle. Cyrillic and Latin alphabets share many letter shapes. Cyrillic а, е, о, р, с, х look identical to Latin a, e, o, p, c, x — but they're different Unicode code points.
LLMs trained on multilingual data — which includes Russian, Ukrainian, Bulgarian, Serbian, and other Cyrillic-script languages alongside English — sometimes generate words that mix scripts. A word might start with a Latin letter and contain a Cyrillic letter in the middle, or vice versa.
The result:
- The word looks correct to a human reader
- The word fails string matching against a dictionary of correctly encoded terms
- Spell checkers often miss it
- Search engines may index it as a different word
This is one of the hardest artifacts to find without character-level scanning.
Why Standard Spell Checkers Don't Help
Spell checkers work at the word level. They compare words against a dictionary. If a word contains a zero-width character or a Cyrillic homoglyph, the spell checker sees a word not in its dictionary — but it doesn't know why it's not matching. It may suggest corrections, or it may just flag the word as unknown and move on.
Detecting Unicode artifacts requires working at the code point level — scanning every character individually and checking it against category sets of known problematic characters.
The Practical Fix
For any workflow that programmatically processes AI-generated text:
- Normalize before storing — run NFC normalization on all input text
- Scan for artifacts — check for zero-width characters, non-standard spaces, and homoglyphs
- Replace typographic punctuation — convert to ASCII equivalents if the destination requires it
- Verify encoding — ensure the storage layer handles UTF-8 correctly end to end
For human editors reviewing AI-generated copy:
- Use a character-level inspector — not just a spell checker
- Pay attention to smart quotes — if you're copying into a CMS with its own smart quote handling, double quotes will get double-converted
- Check around punctuation — zero-width characters most commonly appear next to punctuation marks
TextPurify scans text at the code point level, detecting all four artifact categories in real time — directly in your browser, with no data transmitted to a server.