//textpurify
← Home

Frequently Asked Questions

How every TextPurify tool works, explained.

Cleaner

What does the Cleaner actually remove?

The Cleaner detects and removes four categories of Unicode artifacts: zero-width characters (U+200B, U+200C, U+200D, U+FEFF, and others), non-standard whitespace (non-breaking spaces, thin spaces, narrow no-break spaces, etc.), typographic punctuation (curly quotes, em dashes, ellipsis characters), and homoglyphs (Cyrillic or Greek letters that look identical to Latin characters). You choose which categories to remove using the Clean Mode selector.

What are zero-width characters and why are they a problem?

Zero-width characters are Unicode code points that have no visible representation but still occupy a position in the string. They are commonly inserted by word processors, AI language models, and copy-paste operations. They cause subtle bugs: string length checks pass, but regex matches fail; database lookups return no results for strings that look identical on screen; JSON parsers can reject strings that appear valid.

What is a homoglyph?

A homoglyph is a character from one script that looks visually identical to a character from another script. For example, Cyrillic "а" (U+0430) is indistinguishable from Latin "a" (U+0061). AI models trained on multilingual data occasionally mix scripts within a single word. The result looks correct to a human reader but fails dictionary lookup, spell-checking, and exact-match search.

Does cleaning change the meaning of my text?

No. The Cleaner only removes or replaces characters that are invisible or functionally equivalent to their ASCII counterparts. It does not reorder words, correct grammar, or modify visible content. If you are unsure, use the Inspector tool first to see exactly what will be removed before cleaning.

What is the difference between Clean Mode options?

Standard Unicode removes zero-width characters and non-standard spaces — safe for all text. Normalize Typography additionally converts curly quotes, em dashes, and ellipsis characters to their ASCII equivalents — useful when the destination is an API, database, or code. Full Clean applies all categories including homoglyph replacement.

Inspector

What does the Inspector show?

The Inspector lists every detected artifact in your text with its exact character, Unicode code point (e.g. U+200B), Unicode category, script origin, position in the string, and a plain-language description of what it is. It lets you see precisely what the Cleaner would remove before you commit to the change.

Why would I use the Inspector instead of just cleaning?

Sometimes you want to understand what is in your text rather than blindly remove it. The Inspector is useful for auditing AI-generated content, debugging encoding issues in a pipeline, or verifying that a specific artifact (like a Cyrillic homoglyph) is actually present before reporting it.

Can the Inspector detect artifacts that the Cleaner misses?

The Inspector and Cleaner use the same underlying detection engine, so they see the same artifacts. The Inspector shows you the full list; the Cleaner acts on it. The difference is read vs. write.

Text Stats

What is TTR?

TTR (Type-Token Ratio) is the number of unique words divided by the total number of words. A TTR of 0.72 means 72% of the words in your text are unique. Higher TTR indicates richer vocabulary. TTR is sensitive to text length — short texts always score high — so for longer texts, MTLD is a more reliable measure.

What is MTLD and why is it better than TTR?

MTLD (Measure of Textual Lexical Diversity) corrects the length bias of TTR by measuring how far you can read before the running TTR drops below a threshold of 0.720, then averaging that segment length across forward and backward passes. A score above 70 indicates diverse vocabulary; below 40 suggests repetition or limited word choice.

What does Shannon Entropy measure?

Shannon Entropy measures the predictability of word distribution in your text. A high entropy (≥ 7) means words are spread across many unique terms — the text is unpredictable and varied. A low entropy (< 5) means a few words dominate the text, which signals repetition, keyword stuffing, or template-heavy writing.

What is Burstiness and what does it reveal about AI text?

Burstiness measures the variance in sentence length relative to the mean (coefficient of variation). Human writers naturally vary sentence length — mixing short punchy sentences with longer complex ones — producing a burstiness score above 0.6. AI language models are trained to produce statistically average sentences, resulting in flat burstiness below 0.3. A low burstiness score is one of the signals associated with AI-generated text.

How is the Gunning Fog index calculated?

Gunning Fog = 0.4 × (average sentence length + percentage of words with three or more syllables). It estimates the number of years of formal education a reader needs to understand the text on first reading. A score of 10 is accessible to most adults; above 14 is considered difficult. Academic and legal writing typically scores 15–20.

What are stop words and why are they excluded from the word frequency chart?

Stop words are common function words (the, is, at, which, on…) that appear frequently in all texts and carry little semantic meaning. The word frequency chart excludes them to highlight the content words that define the topic and tone of your specific text.

Linguistic

How does the POS distribution work?

The Linguistic tool uses compromise.js, a lightweight NLP library (~250 KB) that runs entirely in your browser. It tags every word in your text with its part of speech — Noun, Verb, Adjective, Adverb, or Other — using a combination of dictionary lookup and pattern matching. The stacked bar shows the percentage of each category. High noun ratios are typical of formal or academic writing; high verb ratios appear in active, narrative prose.

How is passive voice detected?

Passive voice is detected by identifying sentences where the POS tagger assigns the "Passive" tag to a verb. This typically matches constructions like "was written", "is being processed", "were found". Each flagged sentence is shown so you can decide whether to rewrite it as an active construction.

What are nominalizations and why do they matter?

Nominalizations are verbs or adjectives converted into abstract nouns by adding suffixes like -tion, -ness, -ity, -ment, -ance. For example, "implement" becomes "implementation", "aware" becomes "awareness". Heavy nominalization is a marker of bureaucratic, academic, or AI-generated writing. Replacing nominalizations with their root verb forms makes text more direct and readable: "the implementation of the solution" → "implementing the solution".

What does the sentence complexity score measure?

Sentence complexity counts subordinating conjunctions and relative pronouns (because, although, which, who, that, since, unless…) per sentence and divides by sentence count. A score below 0.6 is Simple — short, direct sentences. Between 0.6 and 1.5 is Moderate — appropriate for most audiences. Above 1.5 is Complex — dense clause structure that may be hard to follow.

Does the Linguistic tool work for languages other than English?

Compromise.js is an English-only library. The POS tagger, passive voice detection, and complexity score are designed for English text. Nominalization detection (suffix matching) will partially work in other Latin-script languages, but results will be less reliable. For non-English text, the SEO & Health tab uses LIX readability, which is language-agnostic.

SEO & Health

What is the Health Score?

The Health Score is a single 0–100 composite metric combining four dimensions: artifact cleanliness (are there Unicode artifacts?), readability (Flesch-Kincaid or LIX score), water density (ratio of filler words), and passage structure (AEO score). A score above 80 means the text is clean, readable, and well-structured. Below 50 signals significant issues in one or more dimensions.

What is the difference between Flesch-Kincaid and LIX?

Flesch-Kincaid measures readability using average sentence length and average syllables per word — it is calibrated for English and produces scores from 0 (very hard) to 100 (very easy). LIX (Läsbarhetsindex) uses sentence length and the percentage of long words (more than 6 characters) — it works consistently across languages because character count is language-agnostic. TextPurify uses Flesch-Kincaid for English content and LIX for Russian and Spanish.

What are "water words"?

Water words (also called filler words) are phrases that add length without adding meaning: "it is important to note that", "in order to", "at the end of the day", "needless to say". AI-generated text tends to have a high water density because language models are trained to produce fluent, human-sounding prose and often default to padding. A water density above 15% is flagged as high.

What is the Passage & AEO Score?

AEO (Answer Engine Optimization) is the practice of structuring content so AI systems — Google AI Overviews, voice search, LLMs with web access — can reliably extract direct answers. The Passage Score evaluates four factors: whether the text has headings, whether headings are phrased as questions, whether paragraphs lead with a direct answer, and whether paragraphs are under 100 words. Text without any headings scores below 70 regardless of writing quality.

How does Sentiment Analysis work?

Sentiment is analyzed using a BERT-based model (DistilBERT fine-tuned on SST-2) that runs entirely in your browser via WebAssembly. The model classifies text as Positive, Negative, or Neutral with a confidence score. Because the model is binary (positive/negative), scores between 55–72% confidence are mapped to Neutral. The model is optimized for English. On the first use, it downloads ~67 MB of model weights and caches them in your browser.

What is Semantic Relevance?

Semantic Relevance uses sentence embeddings (all-MiniLM-L6-v2, ~23 MB) to measure how closely the content of your text matches a topic or query you enter. The model converts your query and each paragraph into high-dimensional vectors and computes cosine similarity. This goes beyond keyword matching — paragraphs that discuss the same concept using different words will score high. Scores above 0.5 are considered highly relevant.

Bulk Upload

What file formats does Bulk Upload support?

Bulk Upload currently supports .txt and .md (Markdown) files. Each file is read and processed through the same Unicode detection engine as the Cleaner. The output is a cleaned version of each file plus a per-file artifact report.

Is there a limit on the number of files or total size?

Free accounts can process up to 10 files per batch. Pro accounts have no batch limit. Individual file size is capped at 1 MB. Processing happens client-side in a Web Worker — no files are uploaded to a server.

How do I download the results?

After processing, each file shows a download button for the cleaned version. You can also download all results as a ZIP archive using the "Download all" button that appears after the batch completes.

API & Keys

What does the TextPurify API do?

The REST API exposes two endpoints: POST /api/v1/analyze returns a JSON report of all detected artifacts in a text (counts by category, positions, code points). POST /api/v1/clean returns the cleaned version of the text with a summary of what was removed. Both endpoints accept plain text in the request body and return JSON.

How do I authenticate API requests?

Include your API key in the Authorization header: Authorization: Bearer YOUR_API_KEY. Generate keys in the API & Keys tab. Each key can be revoked individually. Keys are shown only once at creation — store them securely.

What are the rate limits?

Free accounts: 100 API requests per day, up to 5,000 characters per request. Pro accounts: 10,000 requests per day, up to 100,000 characters per request. Rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining) are included in every response.

Is there an OpenAPI / Swagger spec?

Yes. The full OpenAPI 3.0 specification is available at /api/v1/openapi.json. You can import it into Postman, Insomnia, or any OpenAPI-compatible tool.

History

Where is my History stored?

History snapshots are stored in your browser's localStorage on your device. Nothing is sent to our servers. This means your history is private by design — but it also means it is tied to the browser and device you are using. Clearing your browser storage will erase your history.

How many records does History keep?

History stores the last 50 snapshots. When the limit is reached, the oldest record is automatically removed to make room for the new one.

Can I restore a previous version of my text?

Yes. Click the Restore button on any History entry to load that text back into the editor. This switches you to the Cleaner tool automatically so you can continue working on the restored text.

When is a History record created?

A record is created automatically when you have typed at least 50 characters and paused for 5 seconds (detect snapshot), and immediately when you click the Clean button (clean snapshot). Each record stores the full text, a short preview, the character count, the operation type, and the timestamp.

FAQ — TextPurify