Question 1

What does the Cleaner actually remove?

Accepted Answer

The Cleaner detects and removes four categories of Unicode artifacts: zero-width characters (U+200B, U+200C, U+200D, U+FEFF, and others), non-standard whitespace (non-breaking spaces, thin spaces, narrow no-break spaces, etc.), typographic punctuation (curly quotes, em dashes, ellipsis characters), and homoglyphs (Cyrillic or Greek letters that look identical to Latin characters). You choose which categories to remove using the Clean Mode selector.

Question 2

What are zero-width characters and why are they a problem?

Accepted Answer

Zero-width characters are Unicode code points that have no visible representation but still occupy a position in the string. They are commonly inserted by word processors, AI language models, and copy-paste operations. They cause subtle bugs: string length checks pass, but regex matches fail; database lookups return no results for strings that look identical on screen; JSON parsers can reject strings that appear valid.

Question 3

What is a homoglyph?

Accepted Answer

A homoglyph is a character from one script that looks visually identical to a character from another script. For example, Cyrillic "а" (U+0430) is indistinguishable from Latin "a" (U+0061). AI models trained on multilingual data occasionally mix scripts within a single word. The result looks correct to a human reader but fails dictionary lookup, spell-checking, and exact-match search.

Question 4

Does cleaning change the meaning of my text?

Accepted Answer

No. The Cleaner only removes or replaces characters that are invisible or functionally equivalent to their ASCII counterparts. It does not reorder words, correct grammar, or modify visible content. If you are unsure, use the Inspector tool first to see exactly what will be removed before cleaning.

Question 5

What is the difference between Clean Mode options?

Accepted Answer

Standard Unicode removes zero-width characters and non-standard spaces — safe for all text. Normalize Typography additionally converts curly quotes, em dashes, and ellipsis characters to their ASCII equivalents — useful when the destination is an API, database, or code. Full Clean applies all categories including homoglyph replacement.

Question 6

What does the Inspector show?

Accepted Answer

The Inspector lists every detected artifact in your text with its exact character, Unicode code point (e.g. U+200B), Unicode category, script origin, position in the string, and a plain-language description of what it is. It lets you see precisely what the Cleaner would remove before you commit to the change.

Question 7

Why would I use the Inspector instead of just cleaning?

Accepted Answer

Sometimes you want to understand what is in your text rather than blindly remove it. The Inspector is useful for auditing AI-generated content, debugging encoding issues in a pipeline, or verifying that a specific artifact (like a Cyrillic homoglyph) is actually present before reporting it.

Question 8

Can the Inspector detect artifacts that the Cleaner misses?

Accepted Answer

The Inspector and Cleaner use the same underlying detection engine, so they see the same artifacts. The Inspector shows you the full list; the Cleaner acts on it. The difference is read vs. write.

Question 9

What is TTR?

Accepted Answer

TTR (Type-Token Ratio) is the number of unique words divided by the total number of words. A TTR of 0.72 means 72% of the words in your text are unique. Higher TTR indicates richer vocabulary. TTR is sensitive to text length — short texts always score high — so for longer texts, MTLD is a more reliable measure.

Question 10

What is MTLD and why is it better than TTR?

Accepted Answer

MTLD (Measure of Textual Lexical Diversity) corrects the length bias of TTR by measuring how far you can read before the running TTR drops below a threshold of 0.720, then averaging that segment length across forward and backward passes. A score above 70 indicates diverse vocabulary; below 40 suggests repetition or limited word choice.

Question 11

What does Shannon Entropy measure?

Accepted Answer

Shannon Entropy measures the predictability of word distribution in your text. A high entropy (≥ 7) means words are spread across many unique terms — the text is unpredictable and varied. A low entropy (< 5) means a few words dominate the text, which signals repetition, keyword stuffing, or template-heavy writing.

Question 12

What is Burstiness and what does it reveal about AI text?

Accepted Answer

Burstiness measures the variance in sentence length relative to the mean (coefficient of variation). Human writers naturally vary sentence length — mixing short punchy sentences with longer complex ones — producing a burstiness score above 0.6. AI language models are trained to produce statistically average sentences, resulting in flat burstiness below 0.3. A low burstiness score is one of the signals associated with AI-generated text.

Question 13

How is the Gunning Fog index calculated?

Accepted Answer

Gunning Fog = 0.4 × (average sentence length + percentage of words with three or more syllables). It estimates the number of years of formal education a reader needs to understand the text on first reading. A score of 10 is accessible to most adults; above 14 is considered difficult. Academic and legal writing typically scores 15–20.

Question 14

What are stop words and why are they excluded from the word frequency chart?

Accepted Answer

Stop words are common function words (the, is, at, which, on…) that appear frequently in all texts and carry little semantic meaning. The word frequency chart excludes them to highlight the content words that define the topic and tone of your specific text.

Question 15

How does the POS distribution work?

Accepted Answer

The Linguistic tool uses compromise.js, a lightweight NLP library (~250 KB) that runs entirely in your browser. It tags every word in your text with its part of speech — Noun, Verb, Adjective, Adverb, or Other — using a combination of dictionary lookup and pattern matching. The stacked bar shows the percentage of each category. High noun ratios are typical of formal or academic writing; high verb ratios appear in active, narrative prose.

Question 16

How is passive voice detected?

Accepted Answer

Passive voice is detected by identifying sentences where the POS tagger assigns the "Passive" tag to a verb. This typically matches constructions like "was written", "is being processed", "were found". Each flagged sentence is shown so you can decide whether to rewrite it as an active construction.

Question 17

What are nominalizations and why do they matter?

Accepted Answer

Nominalizations are verbs or adjectives converted into abstract nouns by adding suffixes like -tion, -ness, -ity, -ment, -ance. For example, "implement" becomes "implementation", "aware" becomes "awareness". Heavy nominalization is a marker of bureaucratic, academic, or AI-generated writing. Replacing nominalizations with their root verb forms makes text more direct and readable: "the implementation of the solution" → "implementing the solution".

Question 18

What does the sentence complexity score measure?

Accepted Answer

Sentence complexity counts subordinating conjunctions and relative pronouns (because, although, which, who, that, since, unless…) per sentence and divides by sentence count. A score below 0.6 is Simple — short, direct sentences. Between 0.6 and 1.5 is Moderate — appropriate for most audiences. Above 1.5 is Complex — dense clause structure that may be hard to follow.

Question 19

Does the Linguistic tool work for languages other than English?

Accepted Answer

Compromise.js is an English-only library. The POS tagger, passive voice detection, and complexity score are designed for English text. Nominalization detection (suffix matching) will partially work in other Latin-script languages, but results will be less reliable. For non-English text, the SEO & Health tab uses LIX readability, which is language-agnostic.

Question 20

What is the Health Score?

Accepted Answer

The Health Score is a single 0–100 composite metric combining four dimensions: artifact cleanliness (are there Unicode artifacts?), readability (Flesch-Kincaid or LIX score), water density (ratio of filler words), and passage structure (AEO score). A score above 80 means the text is clean, readable, and well-structured. Below 50 signals significant issues in one or more dimensions.

Question 21

What is the difference between Flesch-Kincaid and LIX?

Accepted Answer

Flesch-Kincaid measures readability using average sentence length and average syllables per word — it is calibrated for English and produces scores from 0 (very hard) to 100 (very easy). LIX (Läsbarhetsindex) uses sentence length and the percentage of long words (more than 6 characters) — it works consistently across languages because character count is language-agnostic. TextPurify uses Flesch-Kincaid for English content and LIX for Russian and Spanish.

Question 22

What are "water words"?

Accepted Answer

Water words (also called filler words) are phrases that add length without adding meaning: "it is important to note that", "in order to", "at the end of the day", "needless to say". AI-generated text tends to have a high water density because language models are trained to produce fluent, human-sounding prose and often default to padding. A water density above 15% is flagged as high.

Question 23

What is the Passage & AEO Score?

Accepted Answer

AEO (Answer Engine Optimization) is the practice of structuring content so AI systems — Google AI Overviews, voice search, LLMs with web access — can reliably extract direct answers. The Passage Score evaluates four factors: whether the text has headings, whether headings are phrased as questions, whether paragraphs lead with a direct answer, and whether paragraphs are under 100 words. Text without any headings scores below 70 regardless of writing quality.

Question 24

How does Sentiment Analysis work?

Accepted Answer

Sentiment is analyzed using a BERT-based model (DistilBERT fine-tuned on SST-2) that runs entirely in your browser via WebAssembly. The model classifies text as Positive, Negative, or Neutral with a confidence score. Because the model is binary (positive/negative), scores between 55–72% confidence are mapped to Neutral. The model is optimized for English. On the first use, it downloads ~67 MB of model weights and caches them in your browser.

Question 25

What is Semantic Relevance?

Accepted Answer

Semantic Relevance uses sentence embeddings (all-MiniLM-L6-v2, ~23 MB) to measure how closely the content of your text matches a topic or query you enter. The model converts your query and each paragraph into high-dimensional vectors and computes cosine similarity. This goes beyond keyword matching — paragraphs that discuss the same concept using different words will score high. Scores above 0.5 are considered highly relevant.

Question 26

What file formats does Bulk Upload support?

Accepted Answer

Bulk Upload currently supports .txt and .md (Markdown) files. Each file is read and processed through the same Unicode detection engine as the Cleaner. The output is a cleaned version of each file plus a per-file artifact report.

Question 27

Is there a limit on the number of files or total size?

Accepted Answer

Free accounts can process up to 10 files per batch. Pro accounts have no batch limit. Individual file size is capped at 1 MB. Processing happens client-side in a Web Worker — no files are uploaded to a server.

Question 28

How do I download the results?

Accepted Answer

After processing, each file shows a download button for the cleaned version. You can also download all results as a ZIP archive using the "Download all" button that appears after the batch completes.

Question 29

What does the TextPurify API do?

Accepted Answer

The REST API exposes two endpoints: POST /api/v1/analyze returns a JSON report of all detected artifacts in a text (counts by category, positions, code points). POST /api/v1/clean returns the cleaned version of the text with a summary of what was removed. Both endpoints accept plain text in the request body and return JSON.

Question 30

How do I authenticate API requests?

Accepted Answer

Include your API key in the Authorization header: Authorization: Bearer YOUR_API_KEY. Generate keys in the API & Keys tab. Each key can be revoked individually. Keys are shown only once at creation — store them securely.

Question 31

What are the rate limits?

Accepted Answer

Free accounts: 100 API requests per day, up to 5,000 characters per request. Pro accounts: 10,000 requests per day, up to 100,000 characters per request. Rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining) are included in every response.

Question 32

Is there an OpenAPI / Swagger spec?

Accepted Answer

Yes. The full OpenAPI 3.0 specification is available at /api/v1/openapi.json. You can import it into Postman, Insomnia, or any OpenAPI-compatible tool.

Question 33

Where is my History stored?

Accepted Answer

History snapshots are stored in your browser's localStorage on your device. Nothing is sent to our servers. This means your history is private by design — but it also means it is tied to the browser and device you are using. Clearing your browser storage will erase your history.

Question 34

How many records does History keep?

Accepted Answer

History stores the last 50 snapshots. When the limit is reached, the oldest record is automatically removed to make room for the new one.

Question 35

Can I restore a previous version of my text?

Accepted Answer

Yes. Click the Restore button on any History entry to load that text back into the editor. This switches you to the Cleaner tool automatically so you can continue working on the restored text.

Question 36

When is a History record created?

Accepted Answer

A record is created automatically when you have typed at least 50 characters and paused for 5 seconds (detect snapshot), and immediately when you click the Clean button (clean snapshot). Each record stores the full text, a short preview, the character count, the operation type, and the timestamp.

Frequently Asked Questions

Cleaner

Inspector

Text Stats

Linguistic

SEO & Health

Bulk Upload

API & Keys

History