Unicode Invisible Characters
The Complete Developer’s Reference
Everything developers, designers, and technical users need to know about invisible Unicode characters — zero-width space, non-breaking space, BOM, directional marks, and more. What each one does, where it causes bugs, how to detect it, how to sanitise it, and when to use it intentionally. With working code examples.
What Are Invisible Characters?
Invisible characters are Unicode code points that render as nothing visible — no glyph, no space, no mark — but still exist as characters in a string. They take up space in memory, affect string length, influence text rendering, and can break string comparisons that don’t account for them.
They appear in text because: (1) they were intentionally inserted for formatting purposes, (2) they were copied from web pages or PDFs that contained them, (3) they were generated by word processors, (4) they exist in AI-generated text, or (5) they were added by attackers deliberately to bypass content filters.
The most common invisible characters that cause real problems in production systems are:
"hello" !== "hello" even though they look identical. Database lookups, password validation, username matching, URL routing, and API key validation are all vulnerable to invisible character bugs.Complete Invisible Character Reference
Every Unicode invisible character developers need to know about. Click Copy next to any character to copy it to your clipboard.
| Character | Code Point | HTML Entity | Primary use | |
|---|---|---|---|---|
| Zero-Width Space | U+200B | |
Soft line break opportunity. Word wrapping in languages without spaces (Thai, Japanese). Breaking URLs. | |
| Non-Breaking Space | U+00A0 | |
Prevents line break between words. Keeps units with numbers (100 km). Standard typography tool. | |
| Hangul Filler | U+3164 | ㅤ |
Korean orthography placeholder. Widely used to create blank usernames in games and social platforms. | |
| Zero-Width Non-Joiner | U+200C | |
Prevents ligature formation in Persian and Arabic scripts. Sometimes misused to bypass word filters. | |
| Zero-Width Joiner | U+200D | |
Joins characters for ligatures. Critical for emoji sequences — 👨👩👧 is three emoji joined by ZWJ. | |
| Word Joiner | U+2060 | |
Prevents line breaks without creating visible space. The modern replacement for U+FEFF when used for line-break control. | |
| Soft Hyphen | U+00AD | ­ |
Invisible but appears as a hyphen at line break points. Used for professional text layout in HTML. | |
| Byte Order Mark (BOM) | U+FEFF | |
UTF-8 BOM — appears at the start of some text files. Causes “extra character” bugs in parsers that don’t strip it. | |
| Left-to-Right Mark | U+200E | |
Forces left-to-right text direction. Used in mixed-direction content (Arabic/Hebrew with Latin). | |
| Right-to-Left Mark | U+200F | |
Forces right-to-left text direction. Can cause text reversal attacks when injected maliciously. | |
| Right-to-Left Override | U+202E | |
Forces all following text to RTL. Used in file extension spoofing attacks (showing .pdf when file is .exe). | |
| Braille Blank | U+2800 | ⠀ |
Braille empty cell. Visually renders as blank space on most systems. Used to create blank Discord usernames. | |
| Thin Space | U+2009 | |
Narrow space — ⅕ of an em. Used in professional typography between numbers and units (5 kg). | |
| Ideographic Space | U+3000 | |
Full-width space matching CJK character width. Used in Japanese and Chinese layout. |
Zero-Width Space (U+200B) — Deep Dive
The Zero-Width Space (ZWSP) is the most commonly encountered invisible character in web development. It has zero visual width but exists as a real character in the string, which means:
String equality fails silently. "hello" and "hello" (with a ZWSP between l and o) look identical but are different strings. This breaks database lookups, login systems, and search.
String length is wrong. "hello" has length 5. "hello" has length 6. If your code validates input length, it may reject or accept the wrong things.
URLs break. A ZWSP in a URL is technically invalid and will cause a 404 or redirect failure even though the URL looks correct visually.
When does ZWSP appear legitimately?
ZWSP is used intentionally to allow line breaks in long strings without adding visible spaces. Common legitimate uses: breaking long URLs in HTML for readability, word wrapping in Thai/Khmer text (which has no natural word boundaries), and allowing compound words to break at appropriate points in German.
Detecting and removing ZWSP in JavaScript
Detecting ZWSP in Python
Non-Breaking Space (U+00A0) — The Most Common Bug Source
The non-breaking space (NBSP) is the invisible character most likely to be copied accidentally from web pages, Word documents, and email clients. It looks exactly like a regular space but behaves differently in many contexts.
Where it comes from: Word processors insert NBSP automatically when you type certain combinations (e.g., after abbreviations like “Mr.” or “Dr.”). Web pages use for spacing. Copying text from these sources brings the NBSP along.
Why it causes bugs: Most split() functions, regex \s patterns, and trim() operations do not treat NBSP the same as a regular space by default. A username with NBSP instead of space may appear valid visually but fail lookups.
NBSP vs regular space in code
Byte Order Mark (U+FEFF) — The Parser Killer
The Byte Order Mark (BOM) is a special Unicode character that appears at the very start of some text files to indicate encoding and byte order. In UTF-8 files, the BOM is unnecessary but some tools (notably Microsoft Notepad until 2019, and some older Windows applications) add it automatically.
The UTF-8 BOM is the byte sequence 0xEF 0xBB 0xBF at the file’s start. When read as UTF-8, it decodes to U+FEFF. If your parser doesn’t strip the BOM, it appears as a mysterious invisible character at the beginning of your data, breaking:
JSON parsing: Valid JSON must start with { or [. A BOM before it causes a parse error even though the file “looks” correct in text editors that strip the BOM display.
CSV imports: A BOM at the start of a CSV file makes the first column header appear as Name instead of Name — causing all code that references that column to silently fail.
XML/HTML doctype: XML processors reject a BOM before the <?xml?> declaration.
Detecting and stripping BOM
Directional Unicode Characters — The Security Risk
Unicode includes characters that control text direction — invisible characters that tell the rendering engine to switch from left-to-right (Latin, etc.) to right-to-left (Arabic, Hebrew) or vice versa. These have legitimate uses in multilingual content but are also exploited for attacks.
Sanitising directional characters
How to Detect Invisible Characters
In VS Code
Open VS Code settings (Ctrl+,) → search “renderWhitespace” → set to “all”. VS Code will now show dots for spaces, arrows for tabs, and crucially, highlighted markers for zero-width and other invisible characters. You can also enable “Unicode Highlight” in settings to flag suspicious characters.
In the browser devtools console
Using SymbolNow’s Unicode Inspector
For a quick visual check without writing code, paste any text into the SymbolNow Unicode Inspector. It reveals every character in the string including invisible ones, showing their code point, name, and category — ideal for debugging content that’s “mysteriously” not matching or failing validation.
Sanitisation Strategy — Production Reference
The right sanitisation approach depends on what you’re cleaning for. Here is a tiered strategy covering the most common cases.
Legitimate Uses — When Invisible Characters Are Correct
Not every invisible character in your data is a bug or attack. Here are the legitimate use cases where these characters belong:
U+200B (ZWSP) in HTML: Long URLs in body text should use ZWSP to allow the URL to wrap at sensible points without adding a visible space that breaks the URL. Example: https://example.com/very-long-path/
U+00A0 (NBSP) in copy: Keep units with their numbers — “100 km”, “£5 million”, “iOS 26” — to prevent orphaned units at line breaks. Standard in professional typography.
U+200D (ZWJ) in emoji: Multi-person emoji are created by joining simpler emoji with ZWJ. 👨👩👧👦 is a sequence of four emoji joined by three ZWJ characters. Never strip ZWJ from emoji sequences.
U+00AD (soft hyphen) in text: HTML ­ marks where a word can hyphenate at line breaks. Used in professional typesetting for German compound words and other long terms.
U+200E/U+200F (LRM/RLM) in multilingual text: When mixing Arabic/Hebrew with Latin in the same paragraph, directional marks are necessary for correct rendering. Strip them from user input but preserve them in your own content.
Security Implications — What to Watch For
Rule of thumb for input validation: On any user input that goes into a database, search index, authentication check, or AI prompt, run sanitisation to remove zero-width characters and directional overrides before processing. Log instances where these are found — repeated occurrences may indicate intentional attacks.
FAQ
Why does my string comparison fail even though the strings look identical? Almost certainly an invisible character — most likely a zero-width space (U+200B) or a non-breaking space (U+00A0). Use the inspection code above to find it, then sanitise your input before comparison.
My JSON file is failing to parse with an “unexpected token” error on the very first character — why? Your file almost certainly has a UTF-8 BOM (U+FEFF) at the start. Open the file in a hex editor or use a BOM-stripping function before parsing. Save future files with UTF-8 without BOM.
Can I use Unicode normalisation (NFC/NFD) to remove invisible characters? No — Unicode normalisation handles composed vs decomposed character forms (é as one character vs e + combining accent). It does not remove zero-width or invisible characters. You need explicit replacement for those.
Does removing invisible characters affect emoji? Be careful with U+200D (Zero-Width Joiner). ZWJ is used to construct multi-component emoji like 👨👩👧 and 🏳️🌈. If you strip ZWJ from emoji sequences, you’ll break them into their component parts. Either exempt ZWJ from removal, or limit removal to contexts where emoji aren’t expected.