Unicode string length can mean different things in different languages

I was working on a text processing example across several different programming languages, including C++, Java, Rust, and Scala, and noticed some discrepancies in the results.

It turned out that these are due to Unicode string length meaning different things in different languages:

In Java, Scala, etc., the length() method returns the number of abstract, high-level characters (glyphs) from a human reader’s point of view.

By contrast, in C++, Go, and Rust, the equivalent functions and methods return a result based on the number of bytes required to store those characters.

jshell> “résumé”.length()

$1 ==> 6

❯ evcxr

Welcome to evcxr. For help, type :help

>> “résumé”.len()

8

>> “résumé”.chars().count()

6

len([]rune(“résumé”)) // returns 6

Apparently it’s a bit more complicated in C++.

Stiri similare

Chicago woman charged with biting cop at Hammond Walmart

Así ha sido el último punto de Nadal en el Mutua Madrid Open y sus partidos contra Djokovic y Federer en la Caja Mágica

Daily News boys athlete of the week: Dylan Volantis, Westlake

The Cheyenne Supercomputer is going for a fraction of its list price at auction right now

City celebrates townhome transformation in Nob Hill

Top battleground Senate race heats up as party-backed Republican faces onslaught from former Trump official

Unicode string length can mean different things in different languages

Related

Leave a Reply Cancel reply

Share on:

Related

Leave a Reply Cancel reply

Stiri similare