Comments for https://horstmann.com/unblog/2023-10-03/index.html
Comments for: https://horstmann.com/unblog/2023-10-03/index.html
- GCon Cunningham @Greycon
Hi Cay - just took delivery of 13th Edition , Volume 2. Question - What exactly is the meaning of the regex used in split("\b{g}") - I know the b is a bounary, but I can't find the {g} anyplace. It always seems to be used for a numerical repeat value. Thanks for all the work! Con
- CCay Horstmann @cayhorstmann
Hi, that's a grapheme cluster boundary. See the last group in Table 2.12.
Splitting along grapheme cluster boundaries breaks a string into what humans perceive as the constituent characters:
"Ciao 🇮🇹".split("\b{g}") // An array with the six elements "C", "i", "a", "o", " ", "🇮🇹"
(The Italian flag actually uses two Unicde characters.)
- HIn reply tosystem⬆:@holger
Logical characters or nowadays “grapheme clusters” were never guaranteed to consist of a single char or codepoint. Take, for example c̆ n̂ a̅ e̊ n̂ c̜ or e̾, which are not as exotic as a pirate flag emoji and exist even in Unicode 1.0. Dealing with individual char values should have been reserved to special cases since day one. By the way s.codePoints().toArray() works in Java 8 too; it’s just not immediately visible in the documentation because is it is inherited from CharSequence. In Java 9, String overrides the method with an optimized version, that’s why it appears as if it was a new method.
- CCay Horstmann @cayhorstmann
Thanks for the correction about the introduction of codePoints. I looked into the source. In Java 8, it was a default method of CharSequence, but in Java 9, compact strings were introduced (https://openjdk.org/jeps/254), and the codePoints method was reimplemented for java.lang.String.
Ideally, the Java designers would have seen the light much earlier that a single char isn't all that meaningful. But in the spirit of the time, the accented letters could be dealt with Unicode normalization (https://www.unicode.org/reports/tr15/), where there always was a normalized variant expressible with a single char.