My Favorite Bugs: Invalid Surrogate Pairs

TL;DR

A rare bug in a collaborative editing tool caused data loss when inserting certain emojis due to surrogate pair splits. The issue was traced to Unicode handling in JavaScript strings. The bug underscores complexities in Unicode processing for developers.

Developers identified a bug in a collaborative editing platform where inserting certain emojis, specifically those above U+FFFF, caused silent data loss during synchronization. This issue was confirmed to stem from how JavaScript handles Unicode surrogate pairs, affecting real-time content syncing.

The bug was encountered during the migration of a legacy editor to a real-time collaborative system using TipTap, ProseMirror, and Yjs. It caused the editor to stop saving changes silently when users inserted or replaced emojis like 🤠 or 👩‍🚀, which are encoded as surrogate pairs in UTF-16.

Investigation revealed that inserting an emoji adjacent to another could split a surrogate pair at an exact byte offset, leading to an orphaned surrogate. When the system attempted to process this malformed string, it triggered an uncaught URIError during encoding, causing the sync process to halt without user notification.

The core of the problem lies in JavaScript’s internal string representation, where emojis above U+FFFF are stored as two code units (a high surrogate and a low surrogate). When operations like .slice() split these, they produce invalid fragments that cause errors downstream.

Why It Matters

This bug highlights the challenges developers face when handling Unicode characters, especially emojis, in web applications. It caused silent data loss in a critical feature, emphasizing the need for robust Unicode handling and error management in collaborative tools. Understanding such issues is vital as digital communication increasingly relies on emojis and complex characters.

Engineering Text: Unicode Standards for Developers (Unicodes Book 1)

As an affiliate, we earn on qualifying purchases.

Background

Unicode characters above U+FFFF require surrogate pairs in UTF-16 encoding, which JavaScript uses internally. Previous versions of the involved libraries did not account for splitting these pairs, leading to invalid strings. The issue was first noticed during early testing phases of a new collaborative editor, with the bug only manifesting under specific editing operations involving emoji insertion or replacement.

“The core problem was that inserting or replacing emojis with surrogate pairs could split the pair, creating invalid strings that the system couldn’t handle, resulting in silent sync failures.”

— Lead Developer

“We realized that certain Unicode characters, especially emojis above U+FFFF, could break our synchronization process if not handled carefully.”

— Product Manager

Versatility Debugging and Programming Tool for STLINK-V3MINIE STLINKV3 Developers in Computer and Hardware Programmer

The Debugger and Programmer a compact yet powerful for efficient debugging and programming, for developers seeking reliability

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear whether similar issues exist in other parts of the application or in other libraries handling Unicode. The full scope of affected characters and the potential for similar bugs in different contexts are still being assessed.

Javascript: Guia do Programador

As an affiliate, we earn on qualifying purchases.

What’s Next

The development team plans to implement stricter Unicode validation and surrogate pair handling in their string operations. They are also reviewing other parts of the codebase for similar vulnerabilities and preparing a patch to prevent future occurrences.

Amazon

collaborative editing Unicode support

As an affiliate, we earn on qualifying purchases.

Key Questions

Why do emojis sometimes cause issues in JavaScript strings?

Because emojis above U+FFFF are stored as surrogate pairs in UTF-16, splitting these pairs can produce invalid strings, leading to errors or silent failures in processing.

How was the bug detected?

The bug was identified when a product manager noticed sync failures after inserting specific emojis, and further debugging revealed surrogate pair splits as the cause.

Will this bug affect all emojis?

No, only those emojis that require surrogate pairs (above U+FFFF) are affected, and only when operations split these pairs at specific byte offsets.

What are the implications for developers working with Unicode?

Developers should carefully handle string operations involving surrogate pairs and implement error handling for malformed strings to prevent silent failures.

Is this issue specific to JavaScript?

While this bug is specific to JavaScript’s UTF-16 string handling, similar issues can occur in any system that processes surrogate pairs without proper validation.

My Favorite Bugs: Invalid Surrogate Pairs

Up next

Why Self-Cleaning Ice Makers Are So Appealing

Author

Best Of Culinary Team

Share article

Why It Matters

Engineering Text: Unicode Standards for Developers (Unicodes Book 1)

Background

Versatility Debugging and Programming Tool for STLINK-V3MINIE STLINKV3 Developers in Computer and Hardware Programmer