If you've ever pasted text from ChatGPT, Claude, or Gemini into a CMS, code editor, or database — and something broke — you've likely encountered Unicode artifacts. They're invisible, they're everywhere, and they're surprisingly hard to find manually.
What Are Unicode Artifacts?
Unicode artifacts are characters that look correct in most contexts but behave unexpectedly. AI language models generate them constantly because they're trained on web content that's full of typographic formatting, multilingual text, and Unicode edge cases.
There are four main categories:
1. Zero-Width Characters
These characters have no visible width. They hide between letters and words, invisible to the naked eye:
| Character | Codepoint | Name |
|---|---|---|
| | U+200B | Zero Width Space |
| | U+00AD | Soft Hyphen |
| | U+FEFF | Byte Order Mark |
| | U+200C | Zero Width Non-Joiner |
| | U+200D | Zero Width Joiner |
Why they matter: These characters break string length calculations, confuse search engines, and can cause unexpected behavior in APIs and databases. A word with a hidden U+200B inside it won't match a search for that word.
2. Non-Standard Spaces
Not all spaces are created equal. AI models often use typographic spaces instead of regular U+0020:
- Non-breaking space (U+00A0) — looks like a space, prevents line breaks, breaks word splitting
- Em space (U+2003) — three times wider than a regular space
- Thin space (U+2009) — used in typographic formatting of numbers
3. Typographic Punctuation
AI models use "smart" punctuation that looks better in prose but breaks code and structured data:
- Smart quotes
""instead of" - Em dash
—instead of- - Ellipsis
…instead of...
If you paste AI text into a JSON field or a CSV without cleaning it first, these characters will corrupt your data.
4. Homoglyphs
These are the most dangerous: characters from one script that visually resemble characters from another. A Cyrillic а (U+0430) is indistinguishable from a Latin a (U+0061) to the human eye, but they're completely different code points.
AI models trained on multilingual data accidentally mix scripts, especially in technical or scientific text where Cyrillic and Latin letters coexist.
How Detection Works
Reliable detection requires scanning every Unicode code point individually — not just running regex patterns on the surface string.
The process:
- Normalize to NFC — canonicalize composed characters so comparison works correctly
- Iterate code points — use a code point iterator (not
.charAt()which breaks on surrogate pairs) - Check against category sets — match against known sets of zero-width, non-standard space, and typographic characters
- Run word-level homoglyph detection — for each word, check if it contains characters from multiple Unicode scripts
This is exactly what TextPurify does — entirely client-side, with no text ever sent to a server.
Cleaning Strategies
Once detected, each category needs a different fix:
- Zero-width characters → remove entirely
- Non-standard spaces → replace with U+0020
- Smart quotes and em dashes → replace with ASCII equivalents (
",-,...) - Homoglyphs → replace with the dominant-script equivalent
The key is to do this in a single forward pass through the string, replacing by index, to avoid offset drift.
Practical Impact
In a study of 1,000 AI-generated articles, over 87% contained at least one Unicode artifact. The most common: soft hyphens (U+00AD) and non-breaking spaces (U+00A0), usually invisible until they break a database query or an API call.
For developers integrating AI-generated content into pipelines, artifact detection should be a mandatory preprocessing step — not an afterthought.
TextPurify detects all four artifact categories in real time, directly in your browser. No data leaves your device.