.. include:: ../../global.rst ********************** Characters and Strings ********************** *We're not in Kansas any more, Toto!* .. aside:: Actually, I did the easy :lname:`C`-style strings and characters initially but went back and re-wrote it all and can now pretend I did it properly in the first place. No-one will ever know! There's no beating about the bush, we need to handle proper multi-character set strings from the get-go. We don't want a :lname:`Python` 2 vs 3 debacle. Not for strings, anyway. .. _`idio unicode`: Unicode ======= .. aside:: Plus the disadvantage of being a native English speaker is that in the further flung regions of the world people want to practice their English and certainly don't want to listen to me insist that my `hovercraft is full of eels`_. I'm not a multi-lingual expert, indeed barely literate in one language, so some of the nuance of multi-character set handling may be lost on me. We're choosing Unicode_ (arguably ISO10646_) because of its name familiarity, even if the actual implementation is less familiar to everyone. The broad thrust of Unicode is to allocate a *code point*, an integer, to the most common characters and support several *combining* code points to create the rest. From a Western Europe viewpoint, we might have an "e acute" character, é, but also an "acute accent", ´, which can be combined with a regular "e". .. aside:: Alarmingly, I see there *is* U+1E2F (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE), ḯ. Looks like I guessed lucky as there's a whole range of (predefined) mark combinations in the Latin Extended Addition section. Clearly, *we* don't need to combine the "acute accent" with a regular "e" as we already have a specific "e acute" but it does allow us to combine it with any other character in some rare combination not specifically covered elsewhere. There must be rules about how combining characters are allowed to, er, combine, to prevent an "e acute diaeresis" (unless that *is* allowed in which case I need to pick a better example). These combinations are known as *grapheme clusters* and edge towards but do not become "characters" *per se*. It's a grey area and you can find plenty of discussion online as to what is and isn't a "character". Here's a example rebuttal of the naïve interpretation of "character" from :ref-author:`torstenvl` in https://news.ycombinator.com/item?id=30384223: A 'character' meaning what? The first code point? The first non-combining code point? The first non-combining code point along with all associated combining code points? The first non-combining code point along with all associated combining code points modified to look like it would look in conjunction with surrounding non-combining code points along with their associated combining code points? The first displayable component of a code point? What are the following? - the second character of sœur (o or the ligature?) - the second character of حبيبي (the canonical form ب or the contextual form ﺒ ?) - the third character of есть (Cyrillic t with or without soft sign, which is always a separate code point and always displayed to the right but changes the sound?) - the first character of 실례합니다 (Korean phoneme or syllabic grapheme?) - the first character of ﷺ or ﷽ ? The main issue isn't programming language support, it's ambiguity in the concept of "character" and conventions about how languages treat them. Imagine how the letter i would be treated if Unicode were invented by a Turk. The fundamental issue here is that human communication is deeply nuanced in a way that does not lend itself well to systematic encoding or fast/naive algorithms on that encoding. Even in the plain ASCII range it's impossible to know how to render a word like "ANSCHLUSS" in lower case (or how many 'characters' such a word would have) without knowledge of the language, country of origin, and time period in which the word was written. And there's plenty of tales of how Unicode doesn't (or can't or won't) do the right thing as we step away from digitized documents (which, by definition must be using a known character set) into the human world. Starting with this from :ref-author:`jake_morrison` in https://news.ycombinator.com/item?id=32095502: In the 90s I worked on a project to digitize land registration in Taiwan. In order to record deeds and property transfers, we needed to enter people's names and official registered addresses into the computer system. The problem was that some people used non-traditional writing variants for their names, and some of their birthplaces were tiny places in China with weird names. Someone might write their name with a two-dot water radical instead of three-dot radical. We would print it out in the normal font, and the people would lose their minds, saying that it was wrong. Chinese people can be superstitious about the number of strokes in their name, so adding a stroke might make it unlucky, so they would not buy the property. The customer went to the agency responsible for managing the big character set, https://en.wikipedia.org/wiki/CNS_11643 Despite having more characters than anything else on earth, it didn't have those variants. The agency said they would not encode them, because they were not real characters, just printing differences. The solution was for the staff in the office to use a "font maker" program to create a custom font with these characters. Then they could print out the deeds using a Chinese variant of Adobe Acrobat, and everyone was happy. .. rst-class:: center --- Most texts fall back to calling code points characters in much the same way we call all 128 ASCII characters, er, characters even though most of the characters below 0x20 make no sense whatsoever as characters that you or I might draw with a pen. 0x03 is ``ETX``, *end of text*. *Eh?* ``ETX`` is, of course, one of the `C0 control codes `_ used for message transmission. Few of these retain any meaning or function and certainly never corresponded with a "character" as, in this case, by definition, it marked the end of characters. I nearly used 0x04, ``EOT``, *end of transmission*, as my example before realising that the *caret notation* for it is ``^D`` which might be confused with the usual keyboard generated ``EOF`` with :kbd:`Ctrl-D`, *end of file* which is clearly a very similar concept. They are completely unrelated, of course, as the terminal *line driver* determines what keystrokes generate what terminal events: .. code-block:: console $ stty -a ... intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = M-^?; eol2 = ; swtch = ; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O; Here, ``VEOF`` is :kbd:`Ctrl-D` -- see :manpage:`termios(3)` for more than you wanted to know. Unicode isn't concerned with *glyphs*, the pictorial representation of characters, either. Even within the same *font* I can see three different glyphs for U+0061 (LATIN SMALL LETTER A) -- even within the constraints of ReStructuredText: .. csv-table:: The same code point in different fonts :widths: 25, 25 :align: left a, regular *a*, italic **a**, bold as I pick out different visual styles. They are all U+0061, though. .. sidebox:: Interestingly, both :program:`emacs` and :program:`vi` are showing the underlined variant of the ordinal indicator in the source where, at least, you can easily see that the glyphs and therefore the code points are different. YMMV. The glyph in your font might also cause some entertainment for the ages were you to mistake 20º with 20°. Here, we have (*foolishly!*) crossed U+00BA (MASCULINE ORDINAL INDICATOR) with U+00B0 (DEGREE SIGN) -- and that's `not the only confusion possible `_ with a small circle glyph. This is not just an issue with squirrelly superscripted characters but also where `Punycode `_ is used in domain names to use non-ASCII characters with similar glyphs to ASCII characters to masquerade one domain as another. The example in the Wikipedia `Homoglyph `_ page is the near identical expressions of a, U+0061 (LATIN SMALL LETTER A), and а, U+0430 (CYRILLIC SMALL LETTER A). .. aside:: Interestingly, technology comes to our aid as the more web sites enforce :abbr:`MFA (Multi-Factor Authoentication)` the likes of WebAuthn_ will not be so casually fooled. Browsers, hopefully, have gotten better at alerting users to the duplicitous ``bаnk.com``. For more of the same *[pun intended]*, try the Unicode `confusables `_ page. Your choice of font introduces another issue. There are around 150 thousand code points defined (of the 1,114,112 possible code points) but the font you are using might only cover a small fraction of those. If a glyph for a code point is missing the result isn't clearly defined. The rendering system may substitute a glyph indicating the code point's number in a box or you may get a blank box. The following is U+01FBF7 (SEGMENTED DIGIT 7), 🯷. (I see a little box with a barely legible ``01F`` on one row and ``BF7`` on another.) There's a much better description of some of the differences between characters and glyphs -- and, indeed, characters and code points -- in Unicode's `Character Encoding Model `_. Unicode differs from ISO10646 in that although they maintain the same set of code points the latter is effectively an extended ISO-8859, meaning a simple list of characters except covering a lot more character sets. Unicode goes further and associates with each code point any number of categories and properties and provides rules on line-breaking, grapheme cluster boundaries, mark rendering and all sorts of other things you didn't realise were an issue. Don't let the simplistic nature of the Unicode_ home page concern you, go straight to the `Unicode reports`_ and get stuck in. .. sidebox:: Do read the `history of UTF-8 `_ with :ref-author:`Rob Pike`, :ref-author:`Ken Thompson` and a New Jersey diner placemat. Actually, don't. Here, in :lname:`Idio`-land, we **do not** "support" Unicode. We use the `Unicode Character Database`_ (UCD) and some categories and properties related to that and UTF-8 encoding. We will use the "simple" lowercase and uppercase properties from the UCD to help with corresponding character mapping functions, for example. However, :lname:`Idio` is not concerned with correct, legal, security or any other Unicode consideration. :lname:`Idio` simply uses whatever is passed to it and actions whatever the string manipulation the user invokes. If the result is non-conformant then so be it. *User error.* .. rst-class:: center --- We *might* have to consider matters such as `Collation `_ of strings -- as we may not be using any system-provided collation library (which you would hope would have considered it). But we really don't want to. That document is 29 thousand words long! For those, like me, who often wonder about the Rest of the World there are non-obvious examples such as: * the set [ "Ähnlich", "Äpfel", "Bären", "Käfer", "küssen" ] would have the strings beginning with Ä sorted to the end as Ä is one of three additional characters and comes after the regular Latin Z in the `Swedish alphabet `_. * in Danish, Aalborg sorts after Zaragoza because aa in personal and geographical names is å (borrowed from the Swedish) and sorts after z (a decision made 7 years after re-introducing the letter in 1948). This also has the side-effect of the regular expression ``[a-z][a-z]`` does not match å in a Danish locale even though it can be expressed as a digraph (``aa``). * German allows for a distinct collation order for telephone listings Of course it would be naïve to believe that there were not a soupçon of names like Cæsar and Zoë roaming around in `English text `_, I wasn't né stupid. I have no idea what the collation algorithm for them is, though. .. aside:: `Ethnologue `_ estimate there are some 4000 writing systems associated with the, roughly, 7000 living languages. Wikipedia has a `list of writing systems `_ which tags most of Europe as Latin where we've already noted that Europe has some funny old language rules. Interestingly, Omniglot claim that there are only 38 `writing systems in current use `_ tagging the majority as Latin. I'm willing to bet that almost any decision you make on the subject of human languages and therefore their translation into digital form will be wrong because of some societal convention you were unaware was even a possibility. I sense it will be a while before humanity adopts a common language with a normal form and until then seeking correctness will be a full-time job. .. rst-class:: center --- We can almost certainly ignore Unicode's view on `Regular Expressions `_ as their view has been swayed by `Perl Regular Expressions `_. In particular, their view on what constitutes an identifier isn't going to help us where we can include many punctuation characters and, hopefully, regular Unicode identifier names. .. rst-class:: center --- As noted previously, the code point range is 2\ :sup:`21` integers but not all of those are valid "characters". A range of values in the first 65,536 is excluded as a side-effect of handling the UTF-16_ encoding when Unicode finally recognised that 65,536 code points was not, in fact, more than enough. That's a slightly unfair comment as 16 bits was more than enough for the original Unicode premise of handling scripts and characters *in modern use*. That premise has changed as Unicode now handles any number of ancient scripts and well as :abbr:`CJK (Chinese, Japanese and Korean)` ideographs. .. aside:: By sheer coincidence I recently watched `The Secret History of Writing`_ in which Chinese junior school children recognised Oracle bone script as they learned to write ideographs. So whilst I was going to originally jest about the weird and wonderful that Unicode now encompasses I am in fact chided by the superior knowledge of a six-year old on the other side of the world. Who else is looking forward to `Oracle bone script`_? It doesn't affect treatment of code points but it it worth understanding that Unicode is (now) defined as 17 *planes* with each plane being 16 bits, ie. potentially 65,536 code points. Note that there are several code points which are in some senses invalid in any encoding including the last two code points in every plane. Each plane is chunked up into varying sized blocks which are allocated to various character set purposes. Some very well known character sets are historically fixed, for example ISO8859-1_, and the block is full. Other character sets have slots reserved within the block for future refinements. The first plane, plane 0, is called the `Basic Multilingual Plane `_ (BMP) and is pretty much full and covers most modern languages. Planes 1, 2 and 3 are supplementary to BMP and are filled to various degrees. Planes 4 through 13 are *unassigned*\ ! Plane 14 contains a small number of special-purpose "characters". Planes 15 and 16 are designated for private use. Unlike, say, RFC1918_ Private Networks which are (usually) prevented from being routed on the Internet, these Private Use planes are ripe for cross-organisational conflict. Anyone wanting to publish `Klingon plqaD `_, Tolkien's runic `Cirth `_ or Medieval texts (see `MUFI `_) on the same page need to coordinate block usage. See `Private Use Areas `_ for some information on likely coordinating publishers. Unicode isn't a clean room setup either. They started by saying the first 256 block would be a straight copy of ISO-8859-1 (Latin-1) which is handy for users of such but it doesn't really follow that it was the best choice in the round. There's all sorts of compromises floating about such as the continued use of `Japanese fullwidth forms`_ -- effectively duplicating ASCII. .. rst-class:: center --- The issue with handling Unicode is, um, everything. We have an issue about the *encoding* in use when we read Unicode in -- commonly, UTF-8, UTF-16 or UTF-32. We have an issue about *storing* code points and strings (ie. arrays of code points) internally. And we have to decide which encoding to use when we emit code points and strings. There's also a subtlety relating to the *meaning* of code points. For example, most of us are familiar with ASCII (and therefore Unicode) decimal digits, 0-9 (U+0030 through to U+0039). Unicode has *lots* of decimal digits, though, some 650 code points have the ``Nd`` category (meaning decimal number) alone. In addition Unicode supports other numeric code points such as Roman numerals . In principle, then, we ought to support the use of any of those code points as numeric inputs -- ie. there are 65 zeroes, 65 ones, 65 twos, etc. -- because we can use a Unicode attribute, *Numeric_Value*, associated with the code point to get its decimal value. However, we then have to consider what it means to mix those Numeric code points across groups in the same word: 1\ :raw-html:`٢`\ :raw-html:`۳`\ :raw-html:`߄`\ :raw-html:`५` is 12345 with a code point from each of the first five of those groups (Latin-1, Arabic-Indic, Extended Arabic-Indic, NKO, Devanagari). Does it make any sense to mix these character sets *in the same expression*\ ? It becomes increasingly complex to reason about these things and the inter-relationship between character sets at which point we start laying down the law. Or would do. At the moment the code invokes the likes of :manpage:`isdigit(3)` and friends which, in turn, use locale-specific lookup tables. Of interest, the only valid input values to those functions are an ``unsigned char`` or ``EOF`` which rules out most CJK character sets and, indeed, everything except Latin-1 in the above example. .. rst-class:: center --- In some ways I think we could be quite pleased that the language allows you to create variables using Unicode code points (outside of Latin-1) and assign values to them using non-ASCII digits. Many people might then bemoan the unreadability of the resultant program forgetting, presumably, that, say, novels are published in foreign languages without much issue. English *appears* to be the *lingua franca* of computing, for good or ill, and I can't see how being flexible enough to support non-ASCII programming changes that. More work (and someone less Western-European-centric and not a monoglot) required to sort this out. In the meanwhile, let's try to break Unicode down. Code Points ----------- Code points are (very) distinct from strings (in any encoding). For a code point we want to indicate which of the 2\ :sup:`21`-ish integers we mean. We've previously said that the reader will use something quite close to the Unicode consortium's own stylised version: ``#U+hhhh``. Although ``hhhh`` represents a hexadecimal number so any number of ``h``\ s which return a suitable number are good. As discussed in :ref:`constants`, we can then stuff that code point into a specific constant type, here, ``ccc`` is ``100`` giving us: .. code-block:: c int uc = 0x0127; /* LATIN SMALL LETTER H WITH STROKE */ IDIO cp = (uc << 5) | 0x10010; obviously, there's a couple of macros to help with that: .. code-block:: c int uc1 = 0x0127; IDIO cp = IDIO_UNICODE (uc1); int uc2 = IDIO_UNICODE_VAL (cp); Code points in and of themselves don't do very much. We can do some comparisons between them as the ``IDIO`` value is a "pointer" so you can perform pointer comparison. But not much else. One thing to be aware is that there are no computational operations you can perform on a code point. You can't "add one" and hope/expect to get a viable code point. Well, you *can* hope/expect but good luck with that. We have the "simple" upper/lower-case mappings from the :term:`UCD` although you should remember that they constitute a directed acyclic graph: .. graphviz:: digraph lower { node [ shape=box ] "0130;LATIN CAPITAL LETTER I WITH DOT ABOVE" -> "0069;LATIN SMALL LETTER I" [ label=" lower " ]; "0069;LATIN SMALL LETTER I" -> "0049;LATIN CAPITAL LETTER I" [ label=" upper " ]; "0049;LATIN CAPITAL LETTER I" -> "0069;LATIN SMALL LETTER I" [ label=" lower " ]; } and .. graphviz:: digraph upper { node [ shape=box ] "01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J" -> "01C7;LATIN CAPITAL LETTER LJ" [ label=" upper " ]; "01C7;LATIN CAPITAL LETTER LJ" -> "01C9;LATIN SMALL LETTER LJ" [ label=" lower " ]; "01C9;LATIN SMALL LETTER LJ" -> "01C7;LATIN CAPITAL LETTER LJ" [ label=" upper " ]; } in both cases no mapping will return you to the starting code point. Reading ^^^^^^^ As noted, there are three reader forms: #. ``#\X`` for some UTF-8 encoding of ``X`` ``#\ħ`` is U+0127 (LATIN SMALL LETTER H WITH STROKE) #. ``#\{...}`` for some (very small set of) named characters #. ``#U+hhhh`` for any code point Writing ^^^^^^^ There are two forms of output depending on whether the character is being printed (ie. a reader-ready representation is being generated) or displayed (as part of a body of other output). If printed, Unicode code points will be printed out in the standard ``#U+hhhh`` format with the exception of code points under 0x80 which are :manpage:`isgraph(3)` which will take the ``#\X`` form. If displayed, the Unicode code point is encoded in UTF-8. .. _strings: :lname:`Idio` Strings ===================== By which we mean arrays of code points... *-ish*. History ------- If we start with, as I did, :lname:`C` strings we can get a handle on what I was doing. In :lname:`C` a string is an array of single byte characters terminated by an ASCII NUL (0x0). That's pretty reasonable and we can do business with that. However, I was thinking that a lot of what we might be doing in a shell is splitting lines of text up into words (or fields if we have an :program:`awk` hat on). We don't *really* want to be re-allocating memory for all these words when we are unlikely to be modifying them, we really just want a substring of the original. To make that happen we'd need: * a pointer to the original string -- so we can maintain a reference so that the parent isn't garbage collected under our feet. * an offset into the parent (or just a pointer to the start within the parent string) * a length Which seems fine. You can imagine there's a pathological case where you might have a substring representing one byte of a 2GB monster you read in from a file and you can't free the space up because of your reference to the one byte substring. I suspect that if that is really a problem then maybe we can have the GC do some *re-imagining* under the hood next time round. I then thought that, partly for consistency and partly for any weird cases where we didn't have a NUL-terminated string to start with, real strings should have a length parameter as well. If that meant we stored an extra byte (with a NUL in it) then so be it (we've just casually added 4 or 8 bytes for a ``size_t`` for the length so an extra byte for a NUL is but a sliver of a nothingness) and it's quite handy for printing the value out. That all worked a peach. Current ------- Along comes Unicode (primarily driven by the need to port some regular expression handling as I hadn't mustered the enthusiasm to re-write strings beforehand). The moral equivalent of the 8-bit characters in :lname:`C` strings are Unicode's code points. However, we *don't actually want* an array of code points because that's a bit dumb -- even we can spot that. Code points are stored in an ``IDIO`` "pointer" and so consume 4 or 8 bytes *each* which is way more than most strings require. We store code points in an ``IDIO`` pointer because they can then be handled like any other :lname:`Idio` type, not because it is efficient. Looking at most of the text that *I* type, I struggle to use all the ASCII characters, let alone any of the exotic delights from cultures further a-field. I'm going to throw this out there that most of the text that *you* type, dear reader, fits in the Unicode Basic Multilingual Plane and is therefore encodable in 2 bytes. I apologise to my Chinese, Japanese and Korean friends who throw 4 byte code points around with abandon. At least you're covered and not forgotten. That blind use of an array of code points will chew up space viciously -- even the four byte code points with 64-bit ``IDIO`` "pointers". What we should be doing, as we read the string in, is to perform some analysis as to which is the largest (widest?) code point in the string and then construct an array where *all* the elements are that wide. We already support the notion of a length so there's no need for trailing NULs -- which don't make any sense in the context of 4 byte wide "characters" anyway. .. code-block:: idio str = "hello" should only require five bytes of storage as it is only using ASCII characters. .. code-block:: idio str = "ħello" Where the first character is U+0127 (LATIN SMALL LETTER H WITH STROKE) will now be ten bytes long as the first code point requires two bytes and therefore so will all the rest, even though they are the same ASCII characters as before. *Dems da breaks.* By and large, though, I sense that most strings are going to be internally very consistent and be: * ASCII/Latin-1 and therefore 1 byte code points *only* * mostly BMP (2 byte) and some 1 byte code points * using 4 byte code points regularly If we join two strings together we can upgrade/widen the one or the other as required. .. aside:: You *monster*\ ! Why are you trying to modify a string *at all*\ ? The only real problem is that anyone wanting to *modify* an element in a string array might get caught out by trying to stuff a 4 byte code point into a one byte string. Feeling rather pleased with my thinking I then discovered that :lname:`Python` had already encapsulated this idea in PEP393_ and I can't believe others haven't done something similar. I felt good for a bit, anyway. So that's the deal. Strings are arrays of elements with widths of 1, 2 or 4 bytes. The string has a length. We can have substrings of it. I've no particular fix for the string modification issue. In principle it requires reworking the string under the feet of the caller but we now have to ensure that all the associated substrings are kept in sync. A rotten workaround would be to prefix any string with a 4 byte letter then only use indexes 1 and beyond. A better workaround would be to allow the forcible creation of, say, 4 byte strings rather than using analysis of the constituent code points. Implementation -------------- We can encode the string array's element width in type-specific flags and then we need an array length and a pointer to the allocated memory for it. .. code-block:: c :caption: gc.h #define IDIO_STRING_FLAG_NONE 0 #define IDIO_STRING_FLAG_1BYTE (1<<0) #define IDIO_STRING_FLAG_2BYTE (1<<1) #define IDIO_STRING_FLAG_4BYTE (1<<2) struct idio_string_s { size_t len; /* code points */ char *s; }; #define IDIO_STRING_LEN(S) ((S)->u.string.len) #define IDIO_STRING_S(S) ((S)->u.string.s) #define IDIO_STRING_FLAGS(S) ((S)->tflags) ``s`` is a ``char *`` (and called ``s``) for the obvious historic reasons although ``s`` is never used to access the elements of the array directly. We will figure out the element width from the flags and then use a ``uint8_t *s8``, ``uint16_t *s16`` or ``uint32_t *s32`` cast from ``s`` as appropriate. The array elements are then casually accessed with ``s8[i]``, ``s16[i]`` or ``s32[i]``. Easy. There's something very similar for a substring which requires: * the reference back to the parent string * ``s`` becomes a pointer directly into the parent string for the start of the substring -- but is otherwise cast to something else in the same manner as above * ``len`` is the substring's length A substring can figure out the width of elements from its parent. A substring of a substring is flattened to just being another substring of the original parent. .. rst-class:: center \* Amongst other possible reworks, I notice many other implementations allocate the equivalent of the ``IDIO`` object and the memory required for the string storage in one block. It would save the two pointers used by :manpage:`malloc(3)` for its accounting and the extra calls to :manpage:`malloc(3)`/:manpage:`free(3)`. .. rst-class:: center \* I did have an idea for a "short" string. The nominal ``IDIO`` ``union`` is three pointers worth -- to accommodate a pair, the most common type -- could we re-work that as a container for short strings? Three pointers worth is 12 or 24 bytes. If we used an ``unsigned char`` for the length then we could handle strings up to 11 or 23 bytes. I think you'd need to do some "field analysis" to see if such short strings occur often enough to make it worth the effort. Encoding -------- One thing hidden from the above discourse is the thorny matter of *encoding*. All mechanisms to move data between entities need to agree on a protocol to encode the data over the transport medium. Unicode used to use UCS-2 and UCS-4 which have been deprecated in favour of UTF-16 and UTF-32 which are 2 and four byte encodings. I get the impression they are popular in the Windows world and they *might* appear as the "wide character" interfaces in Unix-land, see :manpage:`fgetwc(3)`, for example. However, almost everything I see is geared up for UTF-8_ so we'll not buckle any trends. Therefore, :lname:`Idio` expects its inputs to be encoded in UTF-8 and it will generate UTF-8 on output. To read in UTF-8 we use :ref-author:`Bjoern Hoehrmann`'s :ref-title:`Flexible and Economical UTF-8 Decoder` :cite:`BH-UTF-8`, a DFA-based decoder. Categories and Properties ------------------------- Whilst looking for a :ref:`JSON5 extension` library I stumbled across Simon Schoenenberger's `Unicode character lookup table work `_ which I have re-imagined as :ref:`USI`. Reading ------- The input form for a string is quite straightforward: ``"..."``, that is a U+022 (QUOTATION MARK) delimited value. The reader is, in one sense, quite naive and is strictly looking for a non-escaped closing ``"`` to terminate the string, see ``idio_read_string()`` in :file:`src/read.c`. Subsequently the collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume *with the next byte*. This may result in several replacement characters being generated. There are a couple of notes: #. ``\``, U+005C (REVERSE SOLIDUS -- backslash) is the escape character. The obvious character to escape is ``"`` itself allowing you to embed a double-quote symbol in a double-quoted string: ``"hello\"world"``. In the spirit of `C escape sequences `_ :lname:`Idio` also allows: .. csv-table:: Supported escape sequences in strings :header: sequence, (hex) ASCII, description :align: left :widths: auto ``\a``, 07, alert / bell ``\b``, 08, backspace ``\e``, 1B, escape character ``\f``, 0C, form feed ``\n``, 0A, newline ``\r``, 0D, carriage return ``\t``, 09, horizontal tab ``\v``, 0B, vertical tab ``\\``, 5C, backslash ``\x...``, , up to 2 hex digits representing any byte ``\u...``, , up to 4 hex digits representing a Unicode code point ``\U...``, , up to 8 hex digits representing a Unicode code point Any other escaped character results in that character. For ``\x``, ``\u`` and ``\U`` the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit: ``"\Ua9 2021"`` silently stops at the SPACE character giving ``"© 2021"`` and, correspondingly, ``"\u00a92021"`` gives ``"©2021"`` as a maximum of 4 hex digits are consumed by ``\u``. ``\x`` is unrestricted (other than between 0x0 and 0xff) and ``\u`` and ``\U`` will have the hex digits converted into UTF-8. Adding ``\x`` bytes into a string is an exercise in due diligence. #. :lname:`Idio` allows multi-line strings: .. code-block:: idio str1 := "Hello World" str2 := "Hello\nWorld" The string constructors for ``str1`` and ``str2`` are equivalent. Pathnames ^^^^^^^^^ .. aside:: I'm not sure there is a proper nomenclature for pathnames and filenames. I suspect most people will use them interchangeably most of the time (I do) though there is a suggestion of pathnames being filenames joined with ``/``. The reason the ``\x`` escape exists is to allow more convenient creation of "awkward" pathnames. As noted previously \*nix pathnames do not have any *encoding* associated with them. Now, that's not to say you can't *use* an encoding and, indeed, we are probably *encouraged* to use UTF-8 as an encoding for pathnames. However, the problem with the filesystem having no encoding is that both you and I have to agree on what the encoding we have used is. You say potato, I say `solanum tuberosum `_. In general :lname:`Idio` uses the (UTF-8) encoded strings in the source code as pathnames and, as the :lname:`Idio` string to :lname:`C` string converted uses a UTF-8 generator, by and large, every pathname you create will be UTF-8 encoded. However, any pathname already in the filesystem is of an unknown encoding of which the only thing we know is that it won't contain an ASCII NUL and it won't have U+0027 (SOLIDUS -- forward slash) in a directory entry. It's a :lname:`C` string. .. aside:: Which is the vast majority of humans. Ever. *Who knew?* Of course we "get away with" an implicit UTF-8 encoding because the vast majority of \*nix filenames are regular ASCII ones. Only those users outside of North America and islands off the coast of North Western Europe have suffered. So what we really need is to handle such pathnames correctly which means, *not* interpret them. Technically, then, the :lname:`Idio` string ``"hello"`` and the filename :file:`hello` are *different*. Which is going to be a bit annoying from time to time. .. note:: Of interest, it is possible to manipulate strings such that you have a 1-byte width encoding for ``"hello"`` and a 2 or 4-byte encoding for ``"hello"``. However, those strings will be considered ``equal?`` because they have the same length and the same code point at each of their indices. As it so happens, for general file access and creation, the :lname:`Idio` strings, converted into UTF-8 encoded :lname:`C` strings will do the right thing. However, if you consume filenames from the filesystem they will be treated as pathnames and will not be ``equal?`` to the nominally equivalent :lname:`Idio` string. So we need to be able to create \*nix pathnames ourselves for these sorts of comparisons and we have a formatted string style to do that: ``%P"..."`` (or ``%P(...)`` or ``%P{...}`` or ``%P[...]``) where the ``...`` is a regular string as above. That's where the ``\x`` escape for strings comes into its own. If we know that a filename starts with ISO8859-1_'s 0xA9 (the same "character" as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte, 0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a string: ``%P"\xa9..."``. The ``%P`` formatted string is fine if we hand-craft our pathname for comparison but if we already have an :lname:`Idio` string in our hands we need a converter, ``string->pathname``, which will, return a pathname from the UTF-8 encoding of your string. Which sounds slightly pointless but gets us round the problem of matching against pathnames in the filesystem which have no encoding. .. code-block:: idio-console Idio> string->pathname "hello" %P"hello" Notice the leading ``%P`` indicating it is a pathname. Pathname Expansion """""""""""""""""" As if pathnames as an unencoded string aren't complicated enough we want *wildcards*\ ! :lname:`Bash` has a reasonably pragmatic approach to wildcards. From :manpage:`bash(1)`, **Pathname Expansion**: After word splitting, unless the **-f** option has been set, **bash** scans each word for the characters **\***, **?**, and **[**. If one of these characters appears, and is not quoted, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern Unfortunately, our free hand with the code points allowed in symbols means that ``*`` and ``?`` are not just possible but entirely probable and, certainly, to be expected. That makes wildcards a bit tricky. Hmm. I've noted before that Murex_ constrains globbing to :samp:`@\\{g {*.c}}` and extends the mechanism to regexps, :samp:`@\\{rx {\\.c$}}`, and file modes, :samp:`@\\{f {+d}}`. Although I think we want a mechanism to do globbing which is distinct from any sorting and filtering you might perform on the list. We've mention :ref:`pathname templates` before although they are a little bit magical. .. sidebox:: Obviously, ``%P...`` and ``#P...`` are ripe for confusion. I try to think of the ``%`` suggesting a :manpage:`printf(3)` *format* whereas the ``#`` is suggesting the construction of a weird thing. I want ``#P{...}`` (or ``#P(...)`` or ``#P[...]`` or ``#P"..."``) to create a pathname template but not *exercise* it yet. This is akin to creating a regular expression in advance, the :ref:`regcomp ` before the :ref:`regexec ` of POSIX regexs. Only when it is *used* do we :manpage:`glob(3)` the expression. This does lead to a little confusion: .. code-block:: idio-console Idio> p := #P"x.*" # Idio> p (%P"x.idio") Idio> printf "%s\n" p (%P"x.idio") # which feels OK but .. code-block:: idio-console Idio> printf "%s\n" #P"x.*" # # Idio> ls #P"x.*" x.idio #t Feels wrong. Why does the expansion occur for :program:`ls` and not for ``printf``? Technically, for both, the arguments are constructed as you would expect giving both of them the *struct instance* of a ``~path`` (mnemonically, a dynamic path). However, because we need to convert all of the arguments to strings for :manpage:`execve(2)`, the *use* of a ``~path`` struct instance is expanded into a list of pathnames. It's not great. In the meanwhile, we can do the right sorts of things creating and matching files with non-UTF-8 characters in them: .. code-block:: idio-console Idio> close-handle (open-output-file %P"\xa9 2021") #unspec Idio> p := #P"\xa9*" # Idio> p (%P"\xA9 2021") Idio> ls -1f ... ''$'\251'' 2021' ... .. aside:: Rummaging around in the :program:`info` pages I see this is the default "shell-escape" quoting style using the POSIX proposed ``$''`` syntax. *Times have changed since SunOS 4!* I see, here, that :program:`ls` is using a :lname:`Bash`-style ``$'...'`` quoting such that ``''$'\251'' 2021'`` is the concatenation of ``''``, ``$'\251'`` and ``' 2021'`` with ``\251`` being the octal for 0xA9. For :lname:`Idio`, specifically when printing strings in a "use escapes" context, here, at the REPL, a pathname string will have non-:manpage:`isprint(3)` characters converted to a ``\x`` form, hence, ``%P"\xA9 2021"``. Similarly, a pathname with a newline in it would be, say, ``%P"hello\nworld"``. When *not* printing strings in a "use escapes" context, notably when preparing arguments for :manpage:`execve(2)` then we just get the raw \*nix pathname characters meaning something like :program:`ls` won't get in a tizzy: .. code-block:: idio-console Idio> close-handle (open-output-file %P"\xa9 2021") # Idio> ls %P"\xa9 2021" ''$'\251'' 2021' #t compare with the variations for when 0xA9 is an invalid UTF-8 encoding in a regular string and when the Unicode code point U+00A9 is used: .. code-block:: idio-console Idio> ls "\xa9 2021" /usr/bin/ls: cannot access ''$'\357\277\275'' 2021': No such file or directory #f job 327830: (ls "� 2021"): completed: (exit 2) Idio> ls "\ua9 2021" /usr/bin/ls: cannot access ''$'\302\251'' 2021': No such file or directory #f job 327834: (ls "© 2021"): completed: (exit 2) The first is completely garbled and you can see the copyright sign in the notification about the command failure in the second with :program:`ls` complaining that it can't access something beginning with (*\*quickly translates\**) 0xC2 0xA9, the UTF-8 encoding of 0xA9. .. rst-class:: center \* That said, regex and, by extension pattern matching, are not affected by this distinction as they aren't concerned about the *equality* of the strings and/or pathnames so much as whether they conform to a (regular expression) pattern: .. code-block:: idio pt := #P{\xA9*} map (function (p) { printf "%s: " p (pattern-case p ("*2021" { printf "2021!\n" }) (else { printf "nope!\n" })) }) pt .. _`interpolated strings`: Interpolated Strings ^^^^^^^^^^^^^^^^^^^^ Sometimes we want to embed references to variables, usually, or expressions in a string and have something figure out the current value of the variable or result of the expression and replace the reference with that result, something along the lines of: .. code-block:: idio name := "Bob" "Hi, ${name}!" I want double-quoted strings, the usual ``"..."``, to remain as fixed entities, much like single-quoted strings in, say, :lname:`Bash` are. That means we need another format, another ``#`` format! Let's go for ``#S{ ... }`` and everything between the matching braces is our putative string. The ``${name}``\ -style format seems good to me, the only issue being whether we want to change ``$`` for another sigil. In the usual way, we'd pass that in between the ``S`` and ``{``: .. code-block:: idio #S{Hi, ${name}!} #S^{Hi, ^{name}!} The second interpolation character is the escape character, defaulting to ``\``, as usual. You can use the usual matching bracketing characters, ``{}``, ``[]`` and ``()`` to delimit the main block: .. code-block:: idio #S[${name} is ${string-length name} letters long.] but only braces, ``{}`` for the references. There is a subtlety here as the results of the expressions are not necessarily themselves strings. The ``${string-length name}`` expression, for example, will result in an integer and the implied ``append-string`` constructing the result of the interpolated string will get upset. So, in practice, all of the elements of the putative string, the non-expression strings and the results of the expressions are ``map``\ ed against ``->string`` which leaves strings alone and runs the "display" variant of the printer for the results of the expressions. .. aside:: It might be what you want! ``->string`` does not perform any splicing so if your expression returns a list then you'll get a list in your string. Writing ------- :lname:`Idio` strings will be UTF-8 encoded on output, see ``idio_utf8_string()`` in :file:`src/unicode.c` for the details. There's a couple of qualification to that: #. We can ask for the reader's :lname:`C` escape sequences to be reproduced in their ``\X`` format, eg. ``\a`` for alert / bell. #. We can ask for the printed string to be quoted with double quotes. This latter option is a consequence of how we visualise printed entities. The REPL will *print* values in a reader-ready format, so including leading and trailing ``"``\ s. .. code-block:: idio-console Idio> str := "Hello\nWorld" "Hello\nWorld" Idio> str := "Hello World" "Hello\nWorld" By and large, though, most things will *display* a string value as part of a larger output: .. code-block:: idio-console Idio> printf "'%s'\n" str 'Hello World' (Actually, a trailing ``#`` will also be printed which is the value that ``printf`` returned.) Operations ========== .. _`Unicode code point operations`: Characters ---------- .. idio:function:: unicode? value is `value` a Unicode code point .. idio:function:: unicode->plane cp return the Unicode plane of code point `cp` The result is a fixnum. .. idio:function:: unicode->plane-cp cp return the lower 16-bits of the code point `cp` .. idio:function:: unicode->integer cp convert code point `cp` to a fixnum .. idio:function:: unicode=? cp1 cp2 [...] compare code points for equality A minimum of two code points are required. Strings ------- .. idio:function:: string? value is `value` a string (or substring) .. idio:function:: make-string size [fill] create a string of length `size` filled with `fill` characters or U+0020 (SPACE) .. _`string->list`: .. idio:function:: string->list string return a list of the Unicode code points in `string` See also :ref:`list->string string>`. .. _`string->symbol`: .. idio:function:: string->symbol string return a symbol constructed from the UTF-8 Unicode code points in `string` See also :ref:`symbol->string string>`. .. _`append-string`: .. idio:function:: append-string [string ...] return a string constructed by appending the string arguments together If no strings are supplied the result is a zero-length string, cf. ``""``. .. idio:function:: concatenate-string list return a string constructed by appending the strings in `list` together If no strings are supplied the result is a zero-length string, cf. ``""``. .. idio:function:: copy-string string return a copy of string `string` .. idio:function:: string-length string return the length of string `string` .. idio:function:: string-ref string index return the Unicode code point at index `index` of string `string` Indexes start at zero. .. idio:function:: string-set! string index cp set the Unicode code point at index `index` of string `string` to be the Unicode code point `cp` Indexes start at zero. If the number of bytes required to store `cp` is greater than the per-code point width of `string` a ``^string-error`` condition will be raised. .. idio:function:: string-fill! string fill set all indexes of string `string` to be the Unicode code point `fill` If the number of bytes required to store `fill` is greater than the per-code point width of `string` a ``^string-error`` condition will be raised. .. idio:function:: substring string pos-first pos-next return a substring of string `string` starting at index `pos-first` and ending *before* `pos-next`. Indexes start at zero. If `pos-first` and `pos-next` are inconsistent a ``^string-error`` condition will be raised. .. idio:function:: string<=? s1 s2 [...] .. idio:function:: string=? s1 s2 [...] .. idio:function:: string>? s1 s2 [...] .. warning:: Historic code for ASCII/Latin-1 :lname:`Scheme` strings badgered into working at short notice. These need to be replaced with something more Unicode-aware. Perform :manpage:`strncmp(3)` comparisons of the UTF-8 representations of the string arguments .. idio:function:: string-ci<=? s1 s2 [...] .. idio:function:: string-ci=? s1 s2 [...] .. idio:function:: string-ci>? s1 s2 [...] .. warning:: Historic code for ASCII/Latin-1 :lname:`Scheme` strings badgered into working at short notice. These need to be replaced with something more Unicode-aware. Perform :manpage:`strncasecmp(3)` comparisons of the UTF-8 representations of the string arguments .. _`split-string`: .. idio:function:: split-string string [delim] Split string `string` into a list of string delimited by the code points in the string `delim` which itself defaults to :var:`IFS`. ``split-string`` acts like the shell's or :program:`awk`'s word-splitting by ``IFS`` in that multiple adjacent instances of delimiter characters only provoke one "split." .. idio:function:: split-string-exactly string [delim] Split string `string` into a list of string delimited by the code points in the string `delim` which itself defaults to :var:`IFS`. ``split-string-exactly`` is meant to act more like a regular expression matching system. It was originally required to split the contents of the Unicode Character Database file :file:`utils/Unicode/UnicodeData.txt` -- which has multiple ``;``-separated fields, often with no value in a field -- to help generate the code base for regular expression handling. .. _`fields`: .. idio:function:: fields string A variation on :ref:`split-string ` with a view to more :program:`awk`-like line splitting functionality, ``fields`` splits string `string` into an *array* of strings delimited by the code points in :var:`IFS` with the first element of the array being the original string. Indexing of the array gives you :program:`awk`-like field numbers as in :program:`awk`'s ``$0``, ``$1``, ..., here, array index ``0`` for the original string, index ``1`` for the first :var:`IFS` delimited field, ``2`` for the second :var:`IFS` delimited field etc.. As a function taking a single argument it can be used with the :ref:`value-index ` operator, ``.``: .. code-block:: idio-console Idio> fs := "hello world".fields #[ "hello world" "hello" "world" ] Idio> fs.0 "hello world" Idio> fs.1 "hello" .. idio:function:: join-string delim list construct a string from the strings in `list` with the string `delim` placed in between each pair of strings `list` is a, uh, list, here, unlike, say, :ref:`append-string ` as it follows the :lname:`Scheme` form (albeit with arguments shifted about) which takes another parameter indicating the style in which the delimiter should be applied, such as: before or after every argument, infix (the default) and a strict infix for complaining about no arguments. .. _`strip-string`: .. idio:function:: strip-string str discard [ends] return a string where the characters in `discard` have been removed from the `ends` of `str` `ends` can be one of ``'left``, ``'right``, ``'both`` or ``'none`` and defaults to ``'right``. .. include:: ../../commit.rst