Unicode string length can mean different things in different languages

Rmag Breaking News

I was working on a text processing example across several different programming languages, including C++, Java, Rust, and Scala, and noticed some discrepancies in the results.

It turned out that these are due to Unicode string length meaning different things in different languages:

In Java, Scala, etc., the length() method returns the number of abstract, high-level characters (glyphs) from a human reader’s point of view.

By contrast, in C++, Go, and Rust, the equivalent functions and methods return a result based on the number of bytes required to store those characters.

jshell> “résumé”.length()
$1 ==> 6
evcxr
Welcome to evcxr. For help, type :help
>> “résumé”.len()
8
>> “résumé”.chars().count()
6
len([]rune(“résumé”)) // returns 6

Apparently it’s a bit more complicated in C++.

Leave a Reply

Your email address will not be published. Required fields are marked *