My Favorite Bugs: Invalid Surrogate Pairs

TL;DR

A rare bug in a collaborative editing tool was caused by inserting certain emojis that split surrogate pairs, leading to silent sync failures. The issue was traced to Unicode encoding quirks and affected real-time data syncing.

Developers of a real-time collaborative editor identified a bug where inserting specific emojis caused silent failures in content synchronization, due to issues with Unicode surrogate pairs.

The bug was traced to the way JavaScript handles Unicode characters beyond U+FFFF, which require surrogate pairs in UTF-16 encoding. When users inserted emojis like 🤠 (U+1F920) adjacent to each other, the underlying CRDT library, Yjs, would split a surrogate pair, creating invalid strings. These invalid strings caused encodeURIComponent to throw errors during synchronization, leading to silent failure of content saving.

The problem was confirmed through testing with specific emoji combinations, notably those requiring surrogate pairs, and was linked to the lib0 splice method used internally by Yjs, which relied on JavaScript’s .slice() function. When a splice occurred within a surrogate pair, it resulted in a string with an orphaned surrogate, which caused errors during URI encoding.

Why It Matters

This bug highlights the complexities of Unicode handling in web applications, especially those involving real-time collaboration and text encoding. It underscores the importance of understanding character encoding at the code unit and code point levels, particularly for emojis and other extended Unicode characters. For developers, it serves as a reminder to handle surrogate pairs carefully to prevent silent data corruption or loss in collaborative tools.

The Unicode Framework: Building Multilingual Software (programming book)

As an affiliate, we earn on qualifying purchases.

Background

Prior to this discovery, the team had observed sporadic sync failures without clear cause, often in scenarios involving emoji editing. The issue was elusive because it only manifested during specific operations, such as inserting or replacing emojis that involve surrogate pairs. Unicode’s complexity, especially with characters outside the Basic Multilingual Plane (BMP), has long been a source of subtle bugs in web development.

“This bug was caused by how JavaScript handles surrogate pairs in UTF-16, which led to invalid strings during certain edits. Understanding these nuances is critical for building reliable collaborative tools.”

— Lead Developer

“Inserting specific emojis triggered the issue, revealing the hidden complexity of Unicode encoding that many developers overlook.”

— Product Manager

Programming Code Console Log Javascript Debugging Programmer T-Shirt

Programming Code Console Log Javascript Debugging T-shirt. Funny Console Log design perfect for computer geeks, frontend developers, programmers,…

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

While the bug has been identified and a fix implemented, it is not yet clear whether all instances of similar issues have been fully resolved or if other edge cases involving Unicode characters might cause similar silent failures.

Amazon

UTF-16 string handling libraries

As an affiliate, we earn on qualifying purchases.

What’s Next

The development team plans to release a patch that improves Unicode handling, including better validation of surrogate pairs during editing operations. Further testing will be conducted to ensure robustness against similar encoding issues.

Amazon

collaborative editing Unicode fix

As an affiliate, we earn on qualifying purchases.

Key Questions

What are surrogate pairs in Unicode?

Surrogate pairs are two 16-bit code units used in UTF-16 encoding to represent characters outside the Basic Multilingual Plane, such as many emojis and historic scripts.

Why did this bug cause silent sync failures?

The bug caused invalid strings with orphaned surrogate halves, which led encodeURIComponent to throw errors during synchronization, stopping the data sync without alerting users.

Could this issue affect other applications?

Yes, any application that relies on JavaScript’s UTF-16 string handling and performs operations like slicing or encoding on surrogate pairs may be susceptible to similar issues.

How was the bug fixed?

The development team enhanced handling of surrogate pairs, ensuring that operations like .slice() do not split surrogate pairs or produce invalid strings, and added validation during editing.

Will this affect future emoji use in collaborative tools?

Proper handling of surrogate pairs will improve robustness for all Unicode characters, including emojis, preventing similar silent failures in future updates.

My Favorite Bugs: Invalid Surrogate Pairs

Up next

Why is Charlie Stross’s site named Antipope?

Author

Best CAD Papers Team

Why It Matters

The Unicode Framework: Building Multilingual Software (programming book)

Background

Programming Code Console Log Javascript Debugging Programmer T-Shirt