What is a “surrogate pair”? – Explains characters represented by two code points in Unicode

Explanation of IT Terms

What is a “surrogate pair”?

In Unicode, a surrogate pair refers to a mechanism that enables the representation of characters that cannot be encoded with a single code point. It is used specifically for characters that fall outside the Basic Multilingual Plane (BMP), which includes characters from various scripts, symbols, and emojis.

Understanding Unicode

Unicode is a character encoding standard that assigns a unique numerical value, called a code point, to each character. The initial version of Unicode, known as the Unicode 1.0, used a 16-bit encoding, which allowed for a total of 65,536 code points. However, as the demand for character representation expanded, this limitation became apparent.

To accommodate a larger number of characters, Unicode introduced the concept of surrogate pairs. Surrogate pairs are a combination of two 16-bit code units, known as high surrogates and low surrogates.

Working with Surrogate Pairs

To represent a character outside the BMP, Unicode uses two code points. The high surrogate is a code unit from the range U+D800 to U+DBFF, while the low surrogate is a code unit from the range U+DC00 to U+DFFF.

For example, to represent the character “uD83DuDE02” (GRINNING FACE WITH SMILING EYES emoji), the code point is divided into two code units: “uD83D” (high surrogate) and “uDE02” (low surrogate). Together, these code units combine to create the surrogate pair representation of the character.

Software systems must support surrogate pairs to correctly handle and interpret characters beyond the BMP. Without proper support, such characters could be rendered as garbled or missing. It is crucial for text processing algorithms, fonts, and rendering engines to handle surrogate pairs conforming to the Unicode Consortium’s specifications.

Conclusion

Surrogate pairs play a vital role in Unicode encoding by allowing the representation of characters outside the Basic Multilingual Plane. Their introduction provides a way to support a vast range of characters from various scripts, symbols, and emojis. Proper handling and support for surrogate pairs are essential for maintaining the integrity and readability of text in software systems and applications.

Reference Articles

Reference Articles

Read also

[Google Chrome] The definitive solution for right-click translations that no longer come up.