Why TextPurify Processes Everything in Your Browser

Most text analysis tools work the same way: you paste your text, it travels to a server, the server runs the model, the result comes back. Fast, convenient, and invisible — which is exactly the problem.

For writers working on embargoed content, lawyers reviewing contracts under NDA, journalists protecting sources, and developers handling production data, sending text to an external server is not an acceptable trade-off. TextPurify was built to eliminate it.

Here is a precise technical account of how every feature in TextPurify works without transmitting your text.

Unicode Detection and Cleaning

The artifact detection engine is a pure TypeScript module (~8 KB minified). When you paste text into the editor, it runs synchronously in the main JavaScript thread and scans every character against a set of Unicode category tables:

Zero-width characters: U+200B, U+200C, U+200D, U+00AD, U+2060, U+FEFF, and others
Non-standard whitespace: U+00A0, U+202F, U+2009, U+3000, and the full range of Unicode space separators
Typographic punctuation: curly quotes (U+2018–U+201D), em dash (U+2014), ellipsis (U+2026)
Homoglyphs: Cyrillic and Greek characters that are visually identical to Latin letters

The engine produces results in under 5 milliseconds for texts up to 100,000 characters. There is no network request. You can verify this by opening DevTools → Network and filtering for requests while typing — nothing is sent.

Readability, SEO, and Text Quality Metrics

All scoring formulas are implemented in plain TypeScript and computed in the main thread:

Flesch-Kincaid requires word count, sentence count, and syllable count. Syllable counting uses a rule-based algorithm that works offline without a phonetic dictionary.

LIX (for Russian and Spanish) counts sentence length and the percentage of long words (more than 6 characters) — no language model required.

Gunning Fog applies the same approach: average sentence length plus percentage of words with three or more syllables, multiplied by 0.4.

TTR and MTLD (lexical diversity) are computed by iterating over the tokenized word list and maintaining running counts.

Shannon Entropy runs a single pass over the word frequency distribution.

Passage & AEO Score detects headings via regex, counts question-phrased headings, measures paragraph word counts, and checks for answer-first patterns.

Water word detection compares tokens against a static dictionary of filler phrases bundled with the app. No external lookup.

Linguistic Analysis (compromise.js)

POS tagging, passive voice detection, and nominalization analysis use compromise.js, a ~250 KB (gzipped) NLP library that runs entirely in the browser. It uses a combination of a built-in lexicon and pattern-matching rules — no server, no API, no model download.

The library is loaded lazily via dynamic import the first time you open the Linguistic tool, then cached by the browser.

Sentiment Analysis (BERT via WebAssembly)

Sentiment analysis uses distilbert-base-uncased-finetuned-sst-2-english, a 67 MB quantized model from Hugging Face.

The inference stack:

ONNX Runtime Web — a WebAssembly build of the ONNX inference engine. It runs entirely in your browser with no server-side component.
Model weights — downloaded once from Hugging Face's CDN (huggingface.co) and cached in your browser's IndexedDB via the Transformers.js caching layer. After the first load, the model is available offline.
Web Worker — the model runs in a dedicated background thread so inference never blocks the UI. The Worker receives your text, runs the tokenizer and forward pass locally, and posts the result back to the main thread.

The text you enter is passed only to the local WASM runtime. The only external request is the one-time model weight download, which contains no user data.

Semantic Relevance (Sentence Embeddings via WebAssembly)

Semantic relevance uses all-MiniLM-L6-v2, a 23 MB sentence embedding model.

The process:

Your text is split into paragraphs (≥ 15 words each).
The model embeds your query and each paragraph into 384-dimensional vectors.
Cosine similarity is computed between the query vector and each paragraph vector.
The top-scoring paragraphs are returned as the most semantically relevant.

Everything runs in a Web Worker using the same ONNX Runtime Web stack. The Worker is terminated when you navigate away from the Semantic Relevance tool, freeing the memory.

History

Text snapshots in the History tool are stored in your browser's localStorage. They are never synchronized to a server. Clearing your browser storage removes them permanently. No backup exists on our end because we never receive them.

What Does Reach a Server

To be complete: the following does reach our server.

Authentication requests — if you create an account, your email and hashed credentials are stored in our database.
API requests — if you use the REST API (/api/v1/clean, /api/v1/analyze), the text in the request body is processed server-side and logged for rate-limiting purposes (character count only, not the text content).
Usage tracking — character counts processed per session are stored to enforce tier limits. The text itself is not stored.

If you use TextPurify without creating an account and without the API, no text you enter is ever transmitted anywhere.

How to Verify This Yourself

Open TextPurify in your browser.
Open DevTools (F12) → Network tab.
Clear existing requests and check "Preserve log".
Paste a text containing a unique string (e.g., a made-up word).
Watch the network tab while the analysis runs.

You will see requests for static assets (JS chunks, fonts) and potentially the model weight files on first load. You will not see any request containing your text.

Privacy-first architecture is a design constraint, not a marketing claim. It limits what we can build — server-side models are more powerful — but it removes an entire category of trust problems for users who work with sensitive content.