No internet connection
  1. Home

Comments for https://horstmann.com/unblog/2023-10-03/index.html

By System @system
    2024-09-30 18:27:29.661Z
    • 5 comments
    1. G
      Con Cunningham @Greycon
        2024-09-30 18:27:29.805Z

        Hi Cay - just took delivery of 13th Edition , Volume 2. Question - What exactly is the meaning of the regex used in split("\b{g}") - I know the b is a bounary, but I can't find the {g} anyplace. It always seems to be used for a numerical repeat value. Thanks for all the work! Con

        1. CCay Horstmann @cayhorstmann
            2024-09-30 19:20:53.416Z

            Hi, that's a grapheme cluster boundary. See the last group in Table 2.12.

            Splitting along grapheme cluster boundaries breaks a string into what humans perceive as the constituent characters:

            "Ciao 🇮🇹".split("\b{g}") // An array with the six elements "C", "i", "a", "o", " ", "🇮🇹"

            (The Italian flag actually uses two Unicde characters.)

          • H
            In reply tosystem:
            @holger
              2025-01-23 09:56:46.816Z

              Logical characters or nowadays “grapheme clusters” were never guaranteed to consist of a single char or codepoint. Take, for example c̆ n̂ a̅ e̊ n̂ c̜ or e̾, which are not as exotic as a pirate flag emoji and exist even in Unicode 1.0. Dealing with individual char values should have been reserved to special cases since day one. By the way s.codePoints().toArray() works in Java 8 too; it’s just not immediately visible in the documentation because is it is inherited from CharSequence. In Java 9, String overrides the method with an optimized version, that’s why it appears as if it was a new method.

              1. CCay Horstmann @cayhorstmann
                  2025-01-23 20:42:11.338Z

                  Thanks for the correction about the introduction of codePoints. I looked into the source. In Java 8, it was a default method of CharSequence, but in Java 9, compact strings were introduced (https://openjdk.org/jeps/254), and the codePoints method was reimplemented for java.lang.String.

                  Ideally, the Java designers would have seen the light much earlier that a single char isn't all that meaningful. But in the spirit of the time, the accented letters could be dealt with Unicode normalization (https://www.unicode.org/reports/tr15/), where there always was a normalized variant expressible with a single char.

                  1. H@holger
                      2025-02-03 15:50:51.398Z

                      Well, I specifically picked examples which have no precomposed form, so even after normalization, the developer had to deal with characters consisting of multiple char values.