Developer Guide

Unicode Invisible Characters
The Complete Developer’s Reference

Updated April 2026 Zero-width · Non-breaking · Directional Copy · Detect · Debug

Everything developers, designers, and technical users need to know about invisible Unicode characters — zero-width space, non-breaking space, BOM, directional marks, and more. What each one does, where it causes bugs, how to detect it, how to sanitise it, and when to use it intentionally. With working code examples.

📚 In this guide

What are invisible characters? Complete character reference Zero-width space (ZWSP) Non-breaking space (NBSP) Byte Order Mark (BOM) Directional marks How to detect & find them How to sanitise / remove them Legitimate uses Security implications

What Are Invisible Characters?

Invisible characters are Unicode code points that render as nothing visible — no glyph, no space, no mark — but still exist as characters in a string. They take up space in memory, affect string length, influence text rendering, and can break string comparisons that don’t account for them.

They appear in text because: (1) they were intentionally inserted for formatting purposes, (2) they were copied from web pages or PDFs that contained them, (3) they were generated by word processors, (4) they exist in AI-generated text, or (5) they were added by attackers deliberately to bypass content filters.

The most common invisible characters that cause real problems in production systems are:

⚠️

These characters cause real bugs

A zero-width space (U+200B) between characters will cause string equality checks to fail — "hello" !== "hello" even though they look identical. Database lookups, password validation, username matching, URL routing, and API key validation are all vulnerable to invisible character bugs.

Complete Invisible Character Reference

Every Unicode invisible character developers need to know about. Click Copy next to any character to copy it to your clipboard.

Character	Code Point	HTML Entity	Primary use
Zero-Width Space	U+200B		Soft line break opportunity. Word wrapping in languages without spaces (Thai, Japanese). Breaking URLs.
Non-Breaking Space	U+00A0	` `	Prevents line break between words. Keeps units with numbers (100 km). Standard typography tool.
Hangul Filler	U+3164	`ㅤ`	Korean orthography placeholder. Widely used to create blank usernames in games and social platforms.
Zero-Width Non-Joiner	U+200C	`‌`	Prevents ligature formation in Persian and Arabic scripts. Sometimes misused to bypass word filters.
Zero-Width Joiner	U+200D	`‍`	Joins characters for ligatures. Critical for emoji sequences — 👨‍👩‍👧 is three emoji joined by ZWJ.
Word Joiner	U+2060	`⁠`	Prevents line breaks without creating visible space. The modern replacement for U+FEFF when used for line-break control.
Soft Hyphen	U+00AD	``	Invisible but appears as a hyphen at line break points. Used for professional text layout in HTML.
Byte Order Mark (BOM)	U+FEFF		UTF-8 BOM — appears at the start of some text files. Causes “extra character” bugs in parsers that don’t strip it.
Left-to-Right Mark	U+200E	`‎`	Forces left-to-right text direction. Used in mixed-direction content (Arabic/Hebrew with Latin).
Right-to-Left Mark	U+200F	`‏`	Forces right-to-left text direction. Can cause text reversal attacks when injected maliciously.
Right-to-Left Override	U+202E	`‮`	Forces all following text to RTL. Used in file extension spoofing attacks (showing .pdf when file is .exe).
Braille Blank	U+2800	`⠀`	Braille empty cell. Visually renders as blank space on most systems. Used to create blank Discord usernames.
Thin Space	U+2009		Narrow space — ⅕ of an em. Used in professional typography between numbers and units (5 kg).
Ideographic Space	U+3000		Full-width space matching CJK character width. Used in Japanese and Chinese layout.

Zero-Width Space (U+200B) — Deep Dive

The Zero-Width Space (ZWSP) is the most commonly encountered invisible character in web development. It has zero visual width but exists as a real character in the string, which means:

String equality fails silently. "hello" and "hello" (with a ZWSP between l and o) look identical but are different strings. This breaks database lookups, login systems, and search.

String length is wrong. "hello" has length 5. "hello" has length 6. If your code validates input length, it may reject or accept the wrong things.

URLs break. A ZWSP in a URL is technically invalid and will cause a 404 or redirect failure even though the URL looks correct visually.

When does ZWSP appear legitimately?

ZWSP is used intentionally to allow line breaks in long strings without adding visible spaces. Common legitimate uses: breaking long URLs in HTML for readability, word wrapping in Thai/Khmer text (which has no natural word boundaries), and allowing compound words to break at appropriate points in German.

Detecting and removing ZWSP in JavaScript

Detecting ZWSP in Python

import re import unicodedata # Detect zero-width space def has_zwsp(s): return ‘\u200b’ in s # Remove all zero-width characters ZERO_WIDTH_CHARS = re.compile( r'[\u200b\u200c\u200d\u200e\u200f\ufeff\u2060]’ ) def remove_zero_width(s): return ZERO_WIDTH_CHARS.sub(”, s) # Normalise and clean for database storage def clean_for_storage(s): s = remove_zero_width(s) s = unicodedata.normalize(‘NFC’, s) s = s.strip() return s

Non-Breaking Space (U+00A0) — The Most Common Bug Source

The non-breaking space (NBSP) is the invisible character most likely to be copied accidentally from web pages, Word documents, and email clients. It looks exactly like a regular space but behaves differently in many contexts.

Where it comes from: Word processors insert NBSP automatically when you type certain combinations (e.g., after abbreviations like “Mr.” or “Dr.”). Web pages use   for spacing. Copying text from these sources brings the NBSP along.

Why it causes bugs: Most split() functions, regex \s patterns, and trim() operations do not treat NBSP the same as a regular space by default. A username with NBSP instead of space may appear valid visually but fail lookups.

NBSP vs regular space in code

Byte Order Mark (U+FEFF) — The Parser Killer

The Byte Order Mark (BOM) is a special Unicode character that appears at the very start of some text files to indicate encoding and byte order. In UTF-8 files, the BOM is unnecessary but some tools (notably Microsoft Notepad until 2019, and some older Windows applications) add it automatically.

The UTF-8 BOM is the byte sequence 0xEF 0xBB 0xBF at the file’s start. When read as UTF-8, it decodes to U+FEFF. If your parser doesn’t strip the BOM, it appears as a mysterious invisible character at the beginning of your data, breaking:

JSON parsing: Valid JSON must start with { or [. A BOM before it causes a parse error even though the file “looks” correct in text editors that strip the BOM display.

CSV imports: A BOM at the start of a CSV file makes the first column header appear as Name instead of Name — causing all code that references that column to silently fail.

XML/HTML doctype: XML processors reject a BOM before the <?xml?> declaration.

Detecting and stripping BOM

Directional Unicode Characters — The Security Risk

Unicode includes characters that control text direction — invisible characters that tell the rendering engine to switch from left-to-right (Latin, etc.) to right-to-left (Arabic, Hebrew) or vice versa. These have legitimate uses in multilingual content but are also exploited for attacks.

🚨

Right-to-Left Override (U+202E) — File Extension Spoofing

Attackers use U+202E to make a filename appear as something benign while hiding its real extension. A file named invoice_‮fdp.exe displays as invoice_exe.pdf in systems that render directional marks. This technique has been used in phishing attacks. Always sanitise U+202E from filenames and user input.

Sanitising directional characters

How to Detect Invisible Characters

In VS Code

Open VS Code settings (Ctrl+,) → search “renderWhitespace” → set to “all”. VS Code will now show dots for spaces, arrows for tabs, and crucially, highlighted markers for zero-width and other invisible characters. You can also enable “Unicode Highlight” in settings to flag suspicious characters.

In the browser devtools console

// Inspect every character in a string const inspectString = str => { return […str].map((char, i) => ({ index: i, char: char, hex: ‘U+’ + char.codePointAt(0).toString(16).toUpperCase().padStart(4,‘0’), isInvisible: [0x200B,0x200C,0x200D,0xFEFF,0x3164] .includes(char.codePointAt(0)) })); }; inspectString(‘helloworld’); // spots the ZWSP

Using SymbolNow’s Unicode Inspector

For a quick visual check without writing code, paste any text into the SymbolNow Unicode Inspector. It reveals every character in the string including invisible ones, showing their code point, name, and category — ideal for debugging content that’s “mysteriously” not matching or failing validation.

Sanitisation Strategy — Production Reference

The right sanitisation approach depends on what you’re cleaning for. Here is a tiered strategy covering the most common cases.

Legitimate Uses — When Invisible Characters Are Correct

Not every invisible character in your data is a bug or attack. Here are the legitimate use cases where these characters belong:

U+200B (ZWSP) in HTML: Long URLs in body text should use ZWSP to allow the URL to wrap at sensible points without adding a visible space that breaks the URL. Example: https://example.com/very-long-path/

U+00A0 (NBSP) in copy: Keep units with their numbers — “100 km”, “£5 million”, “iOS 26” — to prevent orphaned units at line breaks. Standard in professional typography.

U+200D (ZWJ) in emoji: Multi-person emoji are created by joining simpler emoji with ZWJ. 👨‍👩‍👧‍👦 is a sequence of four emoji joined by three ZWJ characters. Never strip ZWJ from emoji sequences.

U+00AD (soft hyphen) in text: HTML  marks where a word can hyphenate at line breaks. Used in professional typesetting for German compound words and other long terms.

U+200E/U+200F (LRM/RLM) in multilingual text: When mixing Arabic/Hebrew with Latin in the same paragraph, directional marks are necessary for correct rendering. Strip them from user input but preserve them in your own content.

Security Implications — What to Watch For

🚨

Homograph attacks using lookalike Unicode characters

Invisible characters are just one vector. A related attack uses visually similar Unicode characters — for example, Cyrillic “а” (U+0430) instead of Latin “a” (U+0061). The word “аpple.com” looks identical to “apple.com” but is a different domain. Browser address bars now display Punycode for suspicious domains to combat this.

🚨

Prompt injection via invisible characters

Attackers can embed invisible characters between words in text that will be fed to an LLM — for example, hidden instructions in what appears to be normal user-submitted content. If your application passes user text directly to an AI API, always sanitise invisible characters from that input first.

Rule of thumb for input validation: On any user input that goes into a database, search index, authentication check, or AI prompt, run sanitisation to remove zero-width characters and directional overrides before processing. Log instances where these are found — repeated occurrences may indicate intentional attacks.

🔍

Unicode Character Inspector — Reveal Hidden Characters

Paste any text to see every character including invisible ones — code point, name, category, and encoding

→

FAQ

Why does my string comparison fail even though the strings look identical? Almost certainly an invisible character — most likely a zero-width space (U+200B) or a non-breaking space (U+00A0). Use the inspection code above to find it, then sanitise your input before comparison.

My JSON file is failing to parse with an “unexpected token” error on the very first character — why? Your file almost certainly has a UTF-8 BOM (U+FEFF) at the start. Open the file in a hex editor or use a BOM-stripping function before parsing. Save future files with UTF-8 without BOM.

Can I use Unicode normalisation (NFC/NFD) to remove invisible characters? No — Unicode normalisation handles composed vs decomposed character forms (é as one character vs e + combining accent). It does not remove zero-width or invisible characters. You need explicit replacement for those.

Does removing invisible characters affect emoji? Be careful with U+200D (Zero-Width Joiner). ZWJ is used to construct multi-component emoji like 👨‍👩‍👧 and 🏳️‍🌈. If you strip ZWJ from emoji sequences, you’ll break them into their component parts. Either exempt ZWJ from removal, or limit removal to contexts where emoji aren’t expected.

Copied!

Unicode Invisible CharactersThe Complete Developer’s Reference