Characters and Strings

We’re not in Kansas any more, Toto!

There’s no beating about the bush, we need to handle proper multi-character set strings from the get-go. We don’t want a Python 2 vs 3 debacle.

Not for strings, anyway.

Unicode

I’m not a multi-lingual expert, indeed barely literate in one language, so some of the nuance of multi-character set handling may be lost on me. We’re choosing Unicode (arguably ISO10646) because of its name familiarity, even if the actual implementation is less familiar to everyone.

The broad thrust of Unicode is to allocate a code point, an integer, to the most common characters and support several combining code points to create the rest. From a Western Europe viewpoint, we might have an “e acute” character, é, but also an “acute accent”, ´, which can be combined with a regular “e”.

Clearly, we don’t need to combine the “acute accent” with a regular “e” as we already have a specific “e acute” but it does allow us to combine it with any other character in some rare combination not specifically covered elsewhere. There must be rules about how combining characters are allowed to, er, combine, to prevent an “e acute diaeresis” (unless that is allowed in which case I need to pick a better example).

These combinations are known as grapheme clusters and edge towards but do not become “characters” per se. It’s a grey area and you can find plenty of discussion online as to what is and isn’t a “character”.

Here’s a example rebuttal of the naïve interpretation of “character” from torstenvl in https://news.ycombinator.com/item?id=30384223:

A ‘character’ meaning what? The first code point? The first non-combining code point? The first non-combining code point along with all associated combining code points? The first non-combining code point along with all associated combining code points modified to look like it would look in conjunction with surrounding non-combining code points along with their associated combining code points? The first displayable component of a code point? What are the following?

  • the second character of sœur (o or the ligature?)

  • the second character of حبيبي (the canonical form ب or the contextual form ﺒ ?)

  • the third character of есть (Cyrillic t with or without soft sign, which is always a separate code point and always displayed to the right but changes the sound?)

  • the first character of 실례합니다 (Korean phoneme or syllabic grapheme?)

  • the first character of ﷺ or ﷽ ?

The main issue isn’t programming language support, it’s ambiguity in the concept of “character” and conventions about how languages treat them. Imagine how the letter i would be treated if Unicode were invented by a Turk. The fundamental issue here is that human communication is deeply nuanced in a way that does not lend itself well to systematic encoding or fast/naive algorithms on that encoding. Even in the plain ASCII range it’s impossible to know how to render a word like “ANSCHLUSS” in lower case (or how many ‘characters’ such a word would have) without knowledge of the language, country of origin, and time period in which the word was written.

And there’s plenty of tales of how Unicode doesn’t (or can’t or won’t) do the right thing as we step away from digitized documents (which, by definition must be using a known character set) into the human world. Starting with this from jake_morrison in https://news.ycombinator.com/item?id=32095502:

In the 90s I worked on a project to digitize land registration in Taiwan.

In order to record deeds and property transfers, we needed to enter people’s names and official registered addresses into the computer system. The problem was that some people used non-traditional writing variants for their names, and some of their birthplaces were tiny places in China with weird names.

Someone might write their name with a two-dot water radical instead of three-dot radical. We would print it out in the normal font, and the people would lose their minds, saying that it was wrong. Chinese people can be superstitious about the number of strokes in their name, so adding a stroke might make it unlucky, so they would not buy the property.

The customer went to the agency responsible for managing the big character set, https://en.wikipedia.org/wiki/CNS_11643 Despite having more characters than anything else on earth, it didn’t have those variants. The agency said they would not encode them, because they were not real characters, just printing differences.

The solution was for the staff in the office to use a “font maker” program to create a custom font with these characters. Then they could print out the deeds using a Chinese variant of Adobe Acrobat, and everyone was happy.

Most texts fall back to calling code points characters in much the same way we call all 128 ASCII characters, er, characters even though most of the characters below 0x20 make no sense whatsoever as characters that you or I might draw with a pen.

0x03 is ETX, end of text. Eh? ETX is, of course, one of the C0 control codes used for message transmission. Few of these retain any meaning or function and certainly never corresponded with a “character” as, in this case, by definition, it marked the end of characters.

I nearly used 0x04, EOT, end of transmission, as my example before realising that the caret notation for it is ^D which might be confused with the usual keyboard generated EOF with Ctrl-D, end of file which is clearly a very similar concept.

They are completely unrelated, of course, as the terminal line driver determines what keystrokes generate what terminal events:

$ stty -a
...
intr = ^C; quit = ^\; erase = ^?; kill = ^U;
eof = ^D; eol = M-^?; eol2 = <undef>;
swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z;
rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O;

Here, VEOF is Ctrl-D – see termios(3) for more than you wanted to know.

Unicode isn’t concerned with glyphs, the pictorial representation of characters, either. Even within the same font I can see three different glyphs for U+0061 (LATIN SMALL LETTER A) – even within the constraints of ReStructuredText:

The same code point in different fonts

a

regular

a

italic

a

bold

as I pick out different visual styles. They are all U+0061, though.

The glyph in your font might also cause some entertainment for the ages were you to mistake 20º with 20°. Here, we have (foolishly!) crossed U+00BA (MASCULINE ORDINAL INDICATOR) with U+00B0 (DEGREE SIGN) – and that’s not the only confusion possible with a small circle glyph.

This is not just an issue with squirrelly superscripted characters but also where Punycode is used in domain names to use non-ASCII characters with similar glyphs to ASCII characters to masquerade one domain as another. The example in the Wikipedia Homoglyph page is the near identical expressions of a, U+0061 (LATIN SMALL LETTER A), and а, U+0430 (CYRILLIC SMALL LETTER A).

Browsers, hopefully, have gotten better at alerting users to the duplicitous bаnk.com. For more of the same [pun intended], try the Unicode confusables page.

Your choice of font introduces another issue. There are around 150 thousand code points defined (of the 1,114,112 possible code points) but the font you are using might only cover a small fraction of those. If a glyph for a code point is missing the result isn’t clearly defined. The rendering system may substitute a glyph indicating the code point’s number in a box or you may get a blank box. The following is U+01FBF7 (SEGMENTED DIGIT 7), 🯷. (I see a little box with a barely legible 01F on one row and BF7 on another.)

There’s a much better description of some of the differences between characters and glyphs – and, indeed, characters and code points – in Unicode’s Character Encoding Model.

Unicode differs from ISO10646 in that although they maintain the same set of code points the latter is effectively an extended ISO-8859, meaning a simple list of characters except covering a lot more character sets. Unicode goes further and associates with each code point any number of categories and properties and provides rules on line-breaking, grapheme cluster boundaries, mark rendering and all sorts of other things you didn’t realise were an issue.

Don’t let the simplistic nature of the Unicode home page concern you, go straight to the Unicode reports and get stuck in.

Actually, don’t. Here, in Idio-land, we do not “support” Unicode. We use the Unicode Character Database (UCD) and some categories and properties related to that and UTF-8 encoding. We will use the “simple” lowercase and uppercase properties from the UCD to help with corresponding character mapping functions, for example.

However, Idio is not concerned with correct, legal, security or any other Unicode consideration. Idio simply uses whatever is passed to it and actions whatever the string manipulation the user invokes. If the result is non-conformant then so be it. User error.

We might have to consider matters such as Collation of strings – as we may not be using any system-provided collation library (which you would hope would have considered it). But we really don’t want to. That document is 29 thousand words long!

For those, like me, who often wonder about the Rest of the World there are non-obvious examples such as:

  • the set [ “Ähnlich”, “Äpfel”, “Bären”, “Käfer”, “küssen” ] would have the strings beginning with Ä sorted to the end as Ä is one of three additional characters and comes after the regular Latin Z in the Swedish alphabet.

  • in Danish, Aalborg sorts after Zaragoza because aa in personal and geographical names is å (borrowed from the Swedish) and sorts after z (a decision made 7 years after re-introducing the letter in 1948).

    This also has the side-effect of the regular expression [a-z][a-z] does not match å in a Danish locale even though it can be expressed as a digraph (aa).

  • German allows for a distinct collation order for telephone listings

Of course it would be naïve to believe that there were not a soupçon of names like Cæsar and Zoë roaming around in English text, I wasn’t né stupid. I have no idea what the collation algorithm for them is, though.

I’m willing to bet that almost any decision you make on the subject of human languages and therefore their translation into digital form will be wrong because of some societal convention you were unaware was even a possibility.

I sense it will be a while before humanity adopts a common language with a normal form and until then seeking correctness will be a full-time job.

We can almost certainly ignore Unicode’s view on Regular Expressions as their view has been swayed by Perl Regular Expressions. In particular, their view on what constitutes an identifier isn’t going to help us where we can include many punctuation characters and, hopefully, regular Unicode identifier names.

As noted previously, the code point range is 221 integers but not all of those are valid “characters”. A range of values in the first 65,536 is excluded as a side-effect of handling the UTF-16 encoding when Unicode finally recognised that 65,536 code points was not, in fact, more than enough.

That’s a slightly unfair comment as 16 bits was more than enough for the original Unicode premise of handling scripts and characters in modern use. That premise has changed as Unicode now handles any number of ancient scripts and well as CJK ideographs.

Who else is looking forward to Oracle bone script?

It doesn’t affect treatment of code points but it it worth understanding that Unicode is (now) defined as 17 planes with each plane being 16 bits, ie. potentially 65,536 code points. Note that there are several code points which are in some senses invalid in any encoding including the last two code points in every plane.

Each plane is chunked up into varying sized blocks which are allocated to various character set purposes. Some very well known character sets are historically fixed, for example ISO8859-1, and the block is full. Other character sets have slots reserved within the block for future refinements.

The first plane, plane 0, is called the Basic Multilingual Plane (BMP) and is pretty much full and covers most modern languages.

Planes 1, 2 and 3 are supplementary to BMP and are filled to various degrees.

Planes 4 through 13 are unassigned!

Plane 14 contains a small number of special-purpose “characters”.

Planes 15 and 16 are designated for private use. Unlike, say, RFC1918 Private Networks which are (usually) prevented from being routed on the Internet, these Private Use planes are ripe for cross-organisational conflict. Anyone wanting to publish Klingon plqaD, Tolkien’s runic Cirth or Medieval texts (see MUFI) on the same page need to coordinate block usage. See Private Use Areas for some information on likely coordinating publishers.

Unicode isn’t a clean room setup either. They started by saying the first 256 block would be a straight copy of ISO-8859-1 (Latin-1) which is handy for users of such but it doesn’t really follow that it was the best choice in the round. There’s all sorts of compromises floating about such as the continued use of Japanese fullwidth forms – effectively duplicating ASCII.

The issue with handling Unicode is, um, everything. We have an issue about the encoding in use when we read Unicode in – commonly, UTF-8, UTF-16 or UTF-32. We have an issue about storing code points and strings (ie. arrays of code points) internally. And we have to decide which encoding to use when we emit code points and strings.

There’s also a subtlety relating to the meaning of code points. For example, most of us are familiar with ASCII (and therefore Unicode) decimal digits, 0-9 (U+0030 through to U+0039). Unicode has lots of decimal digits, though, some 650 code points have the Nd category (meaning decimal number) alone. In addition Unicode supports other numeric code points such as Roman numerals .

In principle, then, we ought to support the use of any of those code points as numeric inputs – ie. there are 65 zeroes, 65 ones, 65 twos, etc. – because we can use a Unicode attribute, Numeric_Value, associated with the code point to get its decimal value.

However, we then have to consider what it means to mix those Numeric code points across groups in the same word: 1٢۳߄ is 12345 with a code point from each of the first five of those groups (Latin-1, Arabic-Indic, Extended Arabic-Indic, NKO, Devanagari). Does it make any sense to mix these character sets in the same expression?

It becomes increasingly complex to reason about these things and the inter-relationship between character sets at which point we start laying down the law.

Or would do. At the moment the code invokes the likes of isdigit(3) and friends which, in turn, use locale-specific lookup tables. Of interest, the only valid input values to those functions are an unsigned char or EOF which rules out most CJK character sets and, indeed, everything except Latin-1 in the above example.

In some ways I think we could be quite pleased that the language allows you to create variables using Unicode code points (outside of Latin-1) and assign values to them using non-ASCII digits. Many people might then bemoan the unreadability of the resultant program forgetting, presumably, that, say, novels are published in foreign languages without much issue.

English appears to be the lingua franca of computing, for good or ill, and I can’t see how being flexible enough to support non-ASCII programming changes that.

More work (and someone less Western-European-centric and not a monoglot) required to sort this out.

In the meanwhile, let’s try to break Unicode down.

Code Points

Code points are (very) distinct from strings (in any encoding). For a code point we want to indicate which of the 221-ish integers we mean. We’ve previously said that the reader will use something quite close to the Unicode consortium’s own stylised version: #U+hhhh.

Although hhhh represents a hexadecimal number so any number of hs which return a suitable number are good.

As discussed in Constants, we can then stuff that code point into a specific constant type, here, ccc is 100 giving us:

int uc = 0x0127;     /* LATIN SMALL LETTER H WITH STROKE */
IDIO cp = (uc << 5) | 0x10010;

obviously, there’s a couple of macros to help with that:

int uc1 = 0x0127;
IDIO cp = IDIO_UNICODE (uc1);

int uc2 = IDIO_UNICODE_VAL (cp);

Code points in and of themselves don’t do very much. We can do some comparisons between them as the IDIO value is a “pointer” so you can perform pointer comparison. But not much else.

One thing to be aware is that there are no computational operations you can perform on a code point. You can’t “add one” and hope/expect to get a viable code point. Well, you can hope/expect but good luck with that.

We have the “simple” upper/lower-case mappings from the UCD although you should remember that they constitute a directed acyclic graph:

digraph lower { node [ shape=box ] "0130;LATIN CAPITAL LETTER I WITH DOT ABOVE" -> "0069;LATIN SMALL LETTER I" [ label=" lower " ]; "0069;LATIN SMALL LETTER I" -> "0049;LATIN CAPITAL LETTER I" [ label=" upper " ]; "0049;LATIN CAPITAL LETTER I" -> "0069;LATIN SMALL LETTER I" [ label=" lower " ]; }

and

digraph upper { node [ shape=box ] "01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J" -> "01C7;LATIN CAPITAL LETTER LJ" [ label=" upper " ]; "01C7;LATIN CAPITAL LETTER LJ" -> "01C9;LATIN SMALL LETTER LJ" [ label=" lower " ]; "01C9;LATIN SMALL LETTER LJ" -> "01C7;LATIN CAPITAL LETTER LJ" [ label=" upper " ]; }

in both cases no mapping will return you to the starting code point.

Reading

As noted, there are three reader forms:

  1. #\X for some UTF-8 encoding of X

    #\ħ is U+0127 (LATIN SMALL LETTER H WITH STROKE)

  2. #\{...} for some (very small set of) named characters

  3. #U+hhhh for any code point

Writing

There are two forms of output depending on whether the character is being printed (ie. a reader-ready representation is being generated) or displayed (as part of a body of other output).

If printed, Unicode code points will be printed out in the standard #U+hhhh format with the exception of code points under 0x80 which are isgraph(3) which will take the #\X form.

If displayed, the Unicode code point is encoded in UTF-8.

Idio Strings

By which we mean arrays of code points… -ish.

History

If we start with, as I did, C strings we can get a handle on what I was doing.

In C a string is an array of single byte characters terminated by an ASCII NUL (0x0). That’s pretty reasonable and we can do business with that.

However, I was thinking that a lot of what we might be doing in a shell is splitting lines of text up into words (or fields if we have an awk hat on). We don’t really want to be re-allocating memory for all these words when we are unlikely to be modifying them, we really just want a substring of the original.

To make that happen we’d need:

  • a pointer to the original string – so we can maintain a reference so that the parent isn’t garbage collected under our feet.

  • an offset into the parent (or just a pointer to the start within the parent string)

  • a length

Which seems fine. You can imagine there’s a pathological case where you might have a substring representing one byte of a 2GB monster you read in from a file and you can’t free the space up because of your reference to the one byte substring. I suspect that if that is really a problem then maybe we can have the GC do some re-imagining under the hood next time round.

I then thought that, partly for consistency and partly for any weird cases where we didn’t have a NUL-terminated string to start with, real strings should have a length parameter as well.

If that meant we stored an extra byte (with a NUL in it) then so be it (we’ve just casually added 4 or 8 bytes for a size_t for the length so an extra byte for a NUL is but a sliver of a nothingness) and it’s quite handy for printing the value out.

That all worked a peach.

Current

Along comes Unicode (primarily driven by the need to port some regular expression handling as I hadn’t mustered the enthusiasm to re-write strings beforehand).

The moral equivalent of the 8-bit characters in C strings are Unicode’s code points.

However, we don’t actually want an array of code points because that’s a bit dumb – even we can spot that. Code points are stored in an IDIO “pointer” and so consume 4 or 8 bytes each which is way more than most strings require. We store code points in an IDIO pointer because they can then be handled like any other Idio type, not because it is efficient.

Looking at most of the text that I type, I struggle to use all the ASCII characters, let alone any of the exotic delights from cultures further a-field. I’m going to throw this out there that most of the text that you type, dear reader, fits in the Unicode Basic Multilingual Plane and is therefore encodable in 2 bytes.

I apologise to my Chinese, Japanese and Korean friends who throw 4 byte code points around with abandon. At least you’re covered and not forgotten.

That blind use of an array of code points will chew up space viciously – even the four byte code points with 64-bit IDIO “pointers”. What we should be doing, as we read the string in, is to perform some analysis as to which is the largest (widest?) code point in the string and then construct an array where all the elements are that wide. We already support the notion of a length so there’s no need for trailing NULs – which don’t make any sense in the context of 4 byte wide “characters” anyway.

str = "hello"

should only require five bytes of storage as it is only using ASCII characters.

str = "ħello"

Where the first character is U+0127 (LATIN SMALL LETTER H WITH STROKE) will now be ten bytes long as the first code point requires two bytes and therefore so will all the rest, even though they are the same ASCII characters as before. Dems da breaks.

By and large, though, I sense that most strings are going to be internally very consistent and be:

  • ASCII/Latin-1 and therefore 1 byte code points only

  • mostly BMP (2 byte) and some 1 byte code points

  • using 4 byte code points regularly

If we join two strings together we can upgrade/widen the one or the other as required.

The only real problem is that anyone wanting to modify an element in a string array might get caught out by trying to stuff a 4 byte code point into a one byte string.

Feeling rather pleased with my thinking I then discovered that Python had already encapsulated this idea in PEP393 and I can’t believe others haven’t done something similar.

I felt good for a bit, anyway.

So that’s the deal. Strings are arrays of elements with widths of 1, 2 or 4 bytes. The string has a length. We can have substrings of it.

I’ve no particular fix for the string modification issue. In principle it requires reworking the string under the feet of the caller but we now have to ensure that all the associated substrings are kept in sync.

A rotten workaround would be to prefix any string with a 4 byte letter then only use indexes 1 and beyond.

A better workaround would be to allow the forcible creation of, say, 4 byte strings rather than using analysis of the constituent code points.

Implementation

We can encode the string array’s element width in type-specific flags and then we need an array length and a pointer to the allocated memory for it.

gc.h
#define IDIO_STRING_FLAG_NONE                0
#define IDIO_STRING_FLAG_1BYTE               (1<<0)
#define IDIO_STRING_FLAG_2BYTE               (1<<1)
#define IDIO_STRING_FLAG_4BYTE               (1<<2)

struct idio_string_s {
    size_t len;              /* code points */
    char *s;
};

#define IDIO_STRING_LEN(S)   ((S)->u.string.len)
#define IDIO_STRING_S(S)     ((S)->u.string.s)
#define IDIO_STRING_FLAGS(S) ((S)->tflags)

s is a char * (and called s) for the obvious historic reasons although s is never used to access the elements of the array directly. We will figure out the element width from the flags and then use a uint8_t *s8, uint16_t *s16 or uint32_t *s32 cast from s as appropriate. The array elements are then casually accessed with s8[i], s16[i] or s32[i]. Easy.

There’s something very similar for a substring which requires:

  • the reference back to the parent string

  • s becomes a pointer directly into the parent string for the start of the substring – but is otherwise cast to something else in the same manner as above

  • len is the substring’s length

A substring can figure out the width of elements from its parent.

A substring of a substring is flattened to just being another substring of the original parent.

*

Amongst other possible reworks, I notice many other implementations allocate the equivalent of the IDIO object and the memory required for the string storage in one block.

It would save the two pointers used by malloc(3) for its accounting and the extra calls to malloc(3)/free(3).

*

I did have an idea for a “short” string. The nominal IDIO union is three pointers worth – to accommodate a pair, the most common type – could we re-work that as a container for short strings?

Three pointers worth is 12 or 24 bytes. If we used an unsigned char for the length then we could handle strings up to 11 or 23 bytes.

I think you’d need to do some “field analysis” to see if such short strings occur often enough to make it worth the effort.

Encoding

One thing hidden from the above discourse is the thorny matter of encoding. All mechanisms to move data between entities need to agree on a protocol to encode the data over the transport medium.

Unicode used to use UCS-2 and UCS-4 which have been deprecated in favour of UTF-16 and UTF-32 which are 2 and four byte encodings. I get the impression they are popular in the Windows world and they might appear as the “wide character” interfaces in Unix-land, see fgetwc(3), for example.

However, almost everything I see is geared up for UTF-8 so we’ll not buckle any trends.

Therefore, Idio expects its inputs to be encoded in UTF-8 and it will generate UTF-8 on output.

To read in UTF-8 we use Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder [Hoe08], a DFA-based decoder.

Categories and Properties

Whilst looking for a JSON5 library I stumbled across Simon Schoenenberger’s Unicode character lookup table work which I have re-imagined as Unicode Summary Information.

Reading

The input form for a string is quite straightforward: "...", that is a U+022 (QUOTATION MARK) delimited value.

The reader is, in one sense, quite naive and is strictly looking for a non-escaped closing " to terminate the string, see idio_read_string() in src/read.c.

Subsequently the collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume with the next byte. This may result in several replacement characters being generated.

There are a couple of notes:

  1. \, U+005C (REVERSE SOLIDUS – backslash) is the escape character. The obvious character to escape is " itself allowing you to embed a double-quote symbol in a double-quoted string: "hello\"world".

    In the spirit of C escape sequences Idio also allows:

    Supported escape sequences in strings

    sequence

    (hex) ASCII

    description

    \a

    07

    alert / bell

    \b

    08

    backspace

    \e

    1B

    escape character

    \f

    0C

    form feed

    \n

    0A

    newline

    \r

    0D

    carriage return

    \t

    09

    horizontal tab

    \v

    0B

    vertical tab

    \\

    5C

    backslash

    \x...

    up to 2 hex digits representing any byte

    \u...

    up to 4 hex digits representing a Unicode code point

    \U...

    up to 8 hex digits representing a Unicode code point

    Any other escaped character results in that character.

    For \x, \u and \U the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit: "\Ua9 2021" silently stops at the SPACE character giving 2021" and, correspondingly, "\u00a92021" gives "©2021" as a maximum of 4 hex digits are consumed by \u.

    \x is unrestricted (other than between 0x0 and 0xff) and \u and \U will have the hex digits converted into UTF-8.

    Adding \x bytes into a string is an exercise in due diligence.

  2. Idio allows multi-line strings:

    str1 := "Hello
    World"
    
    str2 := "Hello\nWorld"
    

    The string constructors for str1 and str2 are equivalent.

Pathnames

The reason the \x escape exists is to allow more convenient creation of “awkward” pathnames. As noted previously *nix pathnames do not have any encoding associated with them.

Now, that’s not to say you can’t use an encoding and, indeed, we are probably encouraged to use UTF-8 as an encoding for pathnames. However, the problem with the filesystem having no encoding is that both you and I have to agree on what the encoding we have used is. You say potato, I say solanum tuberosum.

In general Idio uses the (UTF-8) encoded strings in the source code as pathnames and, as the Idio string to C string converted uses a UTF-8 generator, by and large, every pathname you create will be UTF-8 encoded.

However, any pathname already in the filesystem is of an unknown encoding of which the only thing we know is that it won’t contain an ASCII NUL and it won’t have U+0027 (SOLIDUS – forward slash) in a directory entry. It’s a C string.

Of course we “get away with” an implicit UTF-8 encoding because the vast majority of *nix filenames are regular ASCII ones. Only those users outside of North America and islands off the coast of North Western Europe have suffered.

So what we really need is to handle such pathnames correctly which means, not interpret them.

Technically, then, the Idio string "hello" and the filename hello are different. Which is going to be a bit annoying from time to time.

Note

Of interest, it is possible to manipulate strings such that you have a 1-byte width encoding for "hello" and a 2 or 4-byte encoding for "hello". However, those strings will be considered equal? because they have the same length and the same code point at each of their indices.

As it so happens, for general file access and creation, the Idio strings, converted into UTF-8 encoded C strings will do the right thing. However, if you consume filenames from the filesystem they will be treated as pathnames and will not be equal? to the nominally equivalent Idio string.

So we need to be able to create *nix pathnames ourselves for these sorts of comparisons and we have a formatted string style to do that: %P"..." (or %P(...) or %P{...} or %P[...]) where the ... is a regular string as above.

That’s where the \x escape for strings comes into its own. If we know that a filename starts with ISO8859-1’s 0xA9 (the same “character” as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte, 0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a string: %P"\xa9...".

The %P formatted string is fine if we hand-craft our pathname for comparison but if we already have an Idio string in our hands we need a converter, string->pathname, which will, return a pathname from the UTF-8 encoding of your string. Which sounds slightly pointless but gets us round the problem of matching against pathnames in the filesystem which have no encoding.

Idio> string->pathname "hello"
%P"hello"

Notice the leading %P indicating it is a pathname.

Pathname Expansion

As if pathnames as an unencoded string aren’t complicated enough we want wildcards!

Bash has a reasonably pragmatic approach to wildcards. From bash(1), Pathname Expansion:

After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, and is not quoted, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern

Unfortunately, our free hand with the code points allowed in symbols means that * and ? are not just possible but entirely probable and, certainly, to be expected. That makes wildcards a bit tricky.

Hmm. I’ve noted before that Murex constrains globbing to @{g *.c} and extends the mechanism to regexps, @{rx \.c$}, and file modes, @{f +d}.

Although I think we want a mechanism to do globbing which is distinct from any sorting and filtering you might perform on the list.

We’ve mention Pathname Templates before although they are a little bit magical.

I want #P{...} (or #P(...) or #P[...] or #P"...") to create a pathname template but not exercise it yet. This is akin to creating a regular expression in advance, the regcomp before the regexec of POSIX regexs.

Only when it is used do we glob(3) the expression. This does lead to a little confusion:

Idio> p := #P"x.*"
#<SI ~path pattern:%P"x.*">
Idio> p
(%P"x.idio")
Idio> printf "%s\n" p
(%P"x.idio")
#<unspec>

which feels OK but

Idio> printf "%s\n" #P"x.*"
#<SI ~path pattern:%P"x.*">
#<unspec>
Idio> ls #P"x.*"
x.idio
#t

Feels wrong. Why does the expansion occur for ls and not for printf? Technically, for both, the arguments are constructed as you would expect giving both of them the struct instance of a ~path (mnemonically, a dynamic path). However, because we need to convert all of the arguments to strings for execve(2), the use of a ~path struct instance is expanded into a list of pathnames.

It’s not great.

In the meanwhile, we can do the right sorts of things creating and matching files with non-UTF-8 characters in them:

Idio> close-handle (open-output-file %P"\xa9 2021")
#unspec
Idio> p := #P"\xa9*"
#<SI ~path pattern:%P"\xA9*">
Idio> p
(%P"\xA9 2021")
Idio> ls -1f
 ...
''$'\251'' 2021'
...

I see, here, that ls is using a Bash-style $'...' quoting such that ''$'\251'' 2021' is the concatenation of '', $'\251' and ' 2021' with \251 being the octal for 0xA9.

For Idio, specifically when printing strings in a “use escapes” context, here, at the REPL, a pathname string will have non-isprint(3) characters converted to a \x form, hence, %P"\xA9 2021". Similarly, a pathname with a newline in it would be, say, %P"hello\nworld".

When not printing strings in a “use escapes” context, notably when preparing arguments for execve(2) then we just get the raw *nix pathname characters meaning something like ls won’t get in a tizzy:

Idio> close-handle (open-output-file %P"\xa9 2021")
#<unspec>
Idio> ls %P"\xa9 2021"
''$'\251'' 2021'
#t

compare with the variations for when 0xA9 is an invalid UTF-8 encoding in a regular string and when the Unicode code point U+00A9 is used:

Idio> ls "\xa9 2021"
/usr/bin/ls: cannot access ''$'\357\277\275'' 2021': No such file or directory
#f
job 327830: (ls "� 2021"): completed: (exit 2)
Idio> ls "\ua9 2021"
/usr/bin/ls: cannot access ''$'\302\251'' 2021': No such file or directory
#f
job 327834: (ls "© 2021"): completed: (exit 2)

The first is completely garbled and you can see the copyright sign in the notification about the command failure in the second with ls complaining that it can’t access something beginning with (*quickly translates*) 0xC2 0xA9, the UTF-8 encoding of 0xA9.

*

That said, regex and, by extension pattern matching, are not affected by this distinction as they aren’t concerned about the equality of the strings and/or pathnames so much as whether they conform to a (regular expression) pattern:

pt := #P{\xA9*}

map (function (p) {
       printf "%s: " p
       (pattern-case p
                     ("*2021" {
                       printf "2021!\n"
                     })
                     (else {
                       printf "nope!\n"
                     }))
}) pt

Interpolated Strings

Sometimes we want to embed references to variables, usually, or expressions in a string and have something figure out the current value of the variable or result of the expression and replace the reference with that result, something along the lines of:

name := "Bob"

"Hi, ${name}!"

I want double-quoted strings, the usual "...", to remain as fixed entities, much like single-quoted strings in, say, Bash are. That means we need another format, another # format! Let’s go for #S{ ... } and everything between the matching braces is our putative string.

The ${name}-style format seems good to me, the only issue being whether we want to change $ for another sigil. In the usual way, we’d pass that in between the S and {:

#S{Hi, ${name}!}

#S^{Hi, ^{name}!}

The second interpolation character is the escape character, defaulting to \, as usual.

You can use the usual matching bracketing characters, {}, [] and () to delimit the main block:

#S[${name} is ${string-length name} letters long.]

but only braces, {} for the references.

There is a subtlety here as the results of the expressions are not necessarily themselves strings. The ${string-length name} expression, for example, will result in an integer and the implied append-string constructing the result of the interpolated string will get upset.

So, in practice, all of the elements of the putative string, the non-expression strings and the results of the expressions are maped against ->string which leaves strings alone and runs the “display” variant of the printer for the results of the expressions.

->string does not perform any splicing so if your expression returns a list then you’ll get a list in your string.

Writing

Idio strings will be UTF-8 encoded on output, see idio_utf8_string() in src/unicode.c for the details.

There’s a couple of qualification to that:

  1. We can ask for the reader’s C escape sequences to be reproduced in their \X format, eg. \a for alert / bell.

  2. We can ask for the printed string to be quoted with double quotes.

    This latter option is a consequence of how we visualise printed entities. The REPL will print values in a reader-ready format, so including leading and trailing "s.

    Idio> str := "Hello\nWorld"
    "Hello\nWorld"
    
    Idio> str := "Hello
    World"
    "Hello\nWorld"
    

    By and large, though, most things will display a string value as part of a larger output:

    Idio> printf "'%s'\n" str
    'Hello
    World'
    

    (Actually, a trailing #<unspec> will also be printed which is the value that printf returned.)

Operations

Characters

function unicode? value

is value a Unicode code point

function unicode->plane cp

return the Unicode plane of code point cp

The result is a fixnum.

function unicode->plane-cp cp

return the lower 16-bits of the code point cp

function unicode->integer cp

convert code point cp to a fixnum

function unicode=? cp1 cp2 [...]

compare code points for equality

A minimum of two code points are required.

Strings

function string? value

is value a string (or substring)

function make-string size [fill]

create a string of length size filled with fill characters or U+0020 (SPACE)

function string->list string

return a list of the Unicode code points in string

See also list->string.

function string->symbol string

return a symbol constructed from the UTF-8 Unicode code points in string

See also symbol->string.

function append-string [string ...]

return a string constructed by appending the string arguments together

If no strings are supplied the result is a zero-length string, cf. "".

function concatenate-string list

return a string constructed by appending the strings in list together

If no strings are supplied the result is a zero-length string, cf. "".

function copy-string string

return a copy of string string

function string-length string

return the length of string string

function string-ref string index

return the Unicode code point at index index of string string

Indexes start at zero.

function string-set! string index cp

set the Unicode code point at index index of string string to be the Unicode code point cp

Indexes start at zero.

If the number of bytes required to store cp is greater than the per-code point width of string a ^string-error condition will be raised.

function string-fill! string fill

set all indexes of string string to be the Unicode code point fill

If the number of bytes required to store fill is greater than the per-code point width of string a ^string-error condition will be raised.

function substring string pos-first pos-next

return a substring of string string starting at index pos-first and ending before pos-next.

Indexes start at zero.

If pos-first and pos-next are inconsistent a ^string-error condition will be raised.

function string<=? s1 s2 [...]
function string<? s1 s2 [...]
function string=? s1 s2 [...]
function string>=? s1 s2 [...]
function string>? s1 s2 [...]

Warning

Historic code for ASCII/Latin-1 Scheme strings badgered into working at short notice.

These need to be replaced with something more Unicode-aware.

Perform strncmp(3) comparisons of the UTF-8 representations of the string arguments

function string-ci<=? s1 s2 [...]
function string-ci<? s1 s2 [...]
function string-ci=? s1 s2 [...]
function string-ci>=? s1 s2 [...]
function string-ci>? s1 s2 [...]

Warning

Historic code for ASCII/Latin-1 Scheme strings badgered into working at short notice.

These need to be replaced with something more Unicode-aware.

Perform strncasecmp(3) comparisons of the UTF-8 representations of the string arguments

function split-string string [delim]

Split string string into a list of string delimited by the code points in the string delim which itself defaults to IFS.

split-string acts like the shell’s or awk’s word-splitting by IFS in that multiple adjacent instances of delimiter characters only provoke one “split.”

function split-string-exactly string [delim]

Split string string into a list of string delimited by the code points in the string delim which itself defaults to IFS.

split-string-exactly is meant to act more like a regular expression matching system.

It was originally required to split the contents of the Unicode Character Database file utils/Unicode/UnicodeData.txt – which has multiple ;-separated fields, often with no value in a field – to help generate the code base for regular expression handling.

function fields string

A variation on split-string with a view to more awk-like line splitting functionality, fields splits string string into an array of strings delimited by the code points in IFS with the first element of the array being the original string.

Indexing of the array gives you awk-like field numbers as in awk’s $0, $1, …, here, array index 0 for the original string, index 1 for the first IFS delimited field, 2 for the second IFS delimited field etc..

As a function taking a single argument it can be used with the value-index operator, .:

Idio> fs := "hello world".fields
#[ "hello world" "hello" "world" ]
Idio> fs.0
"hello world"
Idio> fs.1
"hello"
function join-string delim list

construct a string from the strings in list with the string delim placed in between each pair of strings

list is a, uh, list, here, unlike, say, append-string as it follows the Scheme form (albeit with arguments shifted about) which takes another parameter indicating the style in which the delimiter should be applied, such as: before or after every argument, infix (the default) and a strict infix for complaining about no arguments.

function strip-string str discard [ends]

return a string where the characters in discard have been removed from the ends of str

ends can be one of 'left, 'right, 'both or 'none and defaults to 'right.

Last built at 2024-09-07T06:11:27Z+0000 from 463152b (dev)