Home

Comments for https://horstmann.com/unblog/2023-10-03/index.html

By System @system

2024-09-30 18:27:29.661Z

Comments for: https://horstmann.com/unblog/2023-10-03/index.html

Add Comment

6 comments
Log In

G
Con Cunningham @Greycon
2024-09-30 18:27:29.805Z
Hi Cay - just took delivery of 13th Edition , Volume 2. Question - What exactly is the meaning of the regex used in split("\b{g}") - I know the b is a bounary, but I can't find the {g} anyplace. It always seems to be used for a numerical repeat value. Thanks for all the work! Con
Reply
1. C Cay Horstmann @cayhorstmann
  2024-09-30 19:20:53.416Z
  Hi, that's a grapheme cluster boundary. See the last group in Table 2.12.
  
  Splitting along grapheme cluster boundaries breaks a string into what humans perceive as the constituent characters:
  
  "Ciao 🇮🇹".split("\b{g}") // An array with the six elements "C", "i", "a", "o", " ", "🇮🇹"
  
  (The Italian flag actually uses two Unicde characters.)
  Reply
H
In reply tosystem⬆:
@holger
2025-01-23 09:56:46.816Z
Logical characters or nowadays “grapheme clusters” were never guaranteed to consist of a single char or codepoint. Take, for example c̆ n̂ a̅ e̊ n̂ c̜ or e̾, which are not as exotic as a pirate flag emoji and exist even in Unicode 1.0. Dealing with individual char values should have been reserved to special cases since day one. By the way s.codePoints().toArray() works in Java 8 too; it’s just not immediately visible in the documentation because is it is inherited from CharSequence. In Java 9, String overrides the method with an optimized version, that’s why it appears as if it was a new method.
Reply
1. C Cay Horstmann @cayhorstmann
  2025-01-23 20:42:11.338Z
  Thanks for the correction about the introduction of codePoints. I looked into the source. In Java 8, it was a default method of CharSequence, but in Java 9, compact strings were introduced (https://openjdk.org/jeps/254), and the codePoints method was reimplemented for java.lang.String.
  
  Ideally, the Java designers would have seen the light much earlier that a single char isn't all that meaningful. But in the spirit of the time, the accented letters could be dealt with Unicode normalization (https://www.unicode.org/reports/tr15/), where there always was a normalized variant expressible with a single char.
  Reply
  H @holger
  2025-02-03 15:50:51.398Z
  Well, I specifically picked examples which have no precomposed form, so even after normalization, the developer had to deal with characters consisting of multiple char values.
  
  Reply
  C Cay Horstmann @cayhorstmann
  2025-03-02 14:46:45.330Z
  Thanks, I had no idea that there are letters that require composition. I'd like to learn more. Are all of these actually used for some language? (E.g. I could not find a match for e̾). Is there a comprehensive list?
  
  Reply