[isabelle-dev] Script characters in words

Makarius makarius at sketis.net
Wed Mar 6 14:51:16 CET 2024

On 06/03/2024 13:19, Lawrence Paulson wrote:
> I have noticed something unusual, which is true both on Isabelle2023 and in the current version: most script letters seem to be regarded as word characters (as we would expect), but not all of them.
> To see this, it is enough to type
> 	thm 𝒜_def ℬ_def 𝒞_def 𝒟_def ℰ_def ℱ_def ℛ_def 𝒮_def
> and double click on the letter f. It's pretty hard to explain why A should be treated differently from B here, or R from S.

You have rediscovered a long-standing problem of "Unicode". It started out as 
a universal and uniform 16-bit character sets many decades ago (UCS-2), but is 
now a conglomerate of many add-ons and adhoc complexity, in order to 
approximate text representation for every language on the planet (including 

The problem here is that the script "𝒜" was introduced much later than "ℬ": 
thus it requires 2 16-bit characters instead of just 1. Many Java (or 
JavaScript) programs will usually get it wrong, because they work with type 
Char ("character") instead of Int (codepoint).

This complexity (and failure) of Unicode is avoided in Isabelle by using a 
more stable and uniform notion of "Isabelle symbols". The Isabelle/jEdit 
front-end uses "poor-man's rendering of Isabelle symbols in Unicode". Thus the 
underlying jEdit text editor may get things wrong occasionally: it would be 
better to work natively with Isabelle symbols everywhere.

Here are some snippets to illustrate the situation:

(*Isabelle/ML symbols: good*)
ML ‹Symbol.explode "𝒜_def ℬ_def" |> map (fn s => (s, Symbol.is_letter s))›

(*Isabelle/Scala symbols or UTF-16 codepoints: good*)
Symbol.explode("𝒜_def ℬ_def").map(s => (s, Symbol.is_letter(s)))

(*Java UTF-16 codepoints: good*)
Codepoint.iterator("𝒜_def ℬ_def").toList.map(c => (c, Character.isLetter(c)))

(*Java UCS-2 characters: bad*)
"𝒜_def ℬ_def".toList.map(c => (c, Character.isLetter(c)))

The first one is for the Isabelle/jEdit document buffer, within a formal 
theory context. The other ones are for the Console/Scala REPL: here you may 
notice that the Java GUI component cannot display the "𝒜" anymore, after it 
got chopped up into 2 characters.

I am already used to post tickets to the average Java project, in order to 
make such fine-points more correct. But for jEdit we have a certain status-quo 
that won't move much.


More information about the isabelle-dev mailing list