How to Detect Unicode Artifacts in AI-Generated Text

If you've ever pasted text from ChatGPT, Claude, or Gemini into a CMS, code editor, or database — and something broke — you've likely encountered Unicode artifacts. They're invisible, they're everywhere, and they're surprisingly hard to find manually.

What Are Unicode Artifacts?

Unicode artifacts are characters that look correct in most contexts but behave unexpectedly. AI language models generate them constantly because they're trained on web content that's full of typographic formatting, multilingual text, and Unicode edge cases.

There are four main categories:

1. Zero-Width Characters

These characters have no visible width. They hide between letters and words, invisible to the naked eye:

Character	Codepoint	Name
	U+200B	Zero Width Space
	U+00AD	Soft Hyphen
	U+FEFF	Byte Order Mark
‌	U+200C	Zero Width Non-Joiner
‍	U+200D	Zero Width Joiner

Why they matter: These characters break string length calculations, confuse search engines, and can cause unexpected behavior in APIs and databases. A word with a hidden U+200B inside it won't match a search for that word.

2. Non-Standard Spaces

Not all spaces are created equal. AI models often use typographic spaces instead of regular U+0020:

Non-breaking space (U+00A0) — looks like a space, prevents line breaks, breaks word splitting
Em space (U+2003) — three times wider than a regular space
Thin space (U+2009) — used in typographic formatting of numbers

3. Typographic Punctuation

AI models use "smart" punctuation that looks better in prose but breaks code and structured data:

Smart quotes " " instead of "
Em dash — instead of -
Ellipsis … instead of ...

If you paste AI text into a JSON field or a CSV without cleaning it first, these characters will corrupt your data.

4. Homoglyphs

These are the most dangerous: characters from one script that visually resemble characters from another. A Cyrillic а (U+0430) is indistinguishable from a Latin a (U+0061) to the human eye, but they're completely different code points.

AI models trained on multilingual data accidentally mix scripts, especially in technical or scientific text where Cyrillic and Latin letters coexist.

How Detection Works

Reliable detection requires scanning every Unicode code point individually — not just running regex patterns on the surface string.

The process:

Normalize to NFC — canonicalize composed characters so comparison works correctly
Iterate code points — use a code point iterator (not .charAt() which breaks on surrogate pairs)
Check against category sets — match against known sets of zero-width, non-standard space, and typographic characters
Run word-level homoglyph detection — for each word, check if it contains characters from multiple Unicode scripts

This is exactly what TextPurify does — entirely client-side, with no text ever sent to a server.

Cleaning Strategies

Once detected, each category needs a different fix:

Zero-width characters → remove entirely
Non-standard spaces → replace with U+0020
Smart quotes and em dashes → replace with ASCII equivalents (", -, ...)
Homoglyphs → replace with the dominant-script equivalent

The key is to do this in a single forward pass through the string, replacing by index, to avoid offset drift.

Practical Impact

In a study of 1,000 AI-generated articles, over 87% contained at least one Unicode artifact. The most common: soft hyphens (U+00AD) and non-breaking spaces (U+00A0), usually invisible until they break a database query or an API call.

For developers integrating AI-generated content into pipelines, artifact detection should be a mandatory preprocessing step — not an afterthought.

TextPurify detects all four artifact categories in real time, directly in your browser. No data leaves your device.