Characters and Strings¶
We’re not in Kansas any more, Toto!
There’s no beating about the bush, we need to handle proper multi-character set strings from the get-go. We don’t want a Python 2 vs 3 debacle.
Not for strings, anyway.
Unicode¶
I’m not a multi-lingual expert, indeed barely literate in one language, so some of the nuance of multi-character set handling may be lost on me. We’re choosing Unicode (arguably ISO10646) because of its name familiarity, even if the actual implementation is less familiar to everyone.
The broad thrust of Unicode is to allocate a code point, an integer, to the most common characters and support several combining code points to create the rest. From a Western Europe viewpoint, we might have an “e acute” character, é, but also an “acute accent”, ´, which can be combined with a regular “e”.
Clearly, we don’t need to combine the “acute accent” with a regular “e” as we already have a specific “e acute” but it does allow us to combine it with any other character in some rare combination not specifically covered elsewhere. There must be rules about how combining characters are allowed to, er, combine, to prevent an “e acute diaeresis” (unless that is allowed in which case I need to pick a better example).
These combinations are known as grapheme clusters and edge towards but do not become “characters” per se. It’s a grey area and you can find plenty of discussion online as to what is and isn’t a “character”.
Here’s a example rebuttal of the naïve interpretation of “character” from https://news.ycombinator.com/item?id=30384223:
inA ‘character’ meaning what? The first code point? The first non-combining code point? The first non-combining code point along with all associated combining code points? The first non-combining code point along with all associated combining code points modified to look like it would look in conjunction with surrounding non-combining code points along with their associated combining code points? The first displayable component of a code point? What are the following?
the second character of sœur (o or the ligature?)
the second character of حبيبي (the canonical form ب or the contextual form ﺒ ?)
the third character of есть (Cyrillic t with or without soft sign, which is always a separate code point and always displayed to the right but changes the sound?)
the first character of 실례합니다 (Korean phoneme or syllabic grapheme?)
the first character of ﷺ or ﷽ ?
The main issue isn’t programming language support, it’s ambiguity in the concept of “character” and conventions about how languages treat them. Imagine how the letter i would be treated if Unicode were invented by a Turk. The fundamental issue here is that human communication is deeply nuanced in a way that does not lend itself well to systematic encoding or fast/naive algorithms on that encoding. Even in the plain ASCII range it’s impossible to know how to render a word like “ANSCHLUSS” in lower case (or how many ‘characters’ such a word would have) without knowledge of the language, country of origin, and time period in which the word was written.
And there’s plenty of tales of how Unicode doesn’t (or can’t or won’t) do the right thing as we step away from digitized documents (which, by definition must be using a known character set) into the human world. Starting with this from https://news.ycombinator.com/item?id=32095502:
inIn the 90s I worked on a project to digitize land registration in Taiwan.
In order to record deeds and property transfers, we needed to enter people’s names and official registered addresses into the computer system. The problem was that some people used non-traditional writing variants for their names, and some of their birthplaces were tiny places in China with weird names.
Someone might write their name with a two-dot water radical instead of three-dot radical. We would print it out in the normal font, and the people would lose their minds, saying that it was wrong. Chinese people can be superstitious about the number of strokes in their name, so adding a stroke might make it unlucky, so they would not buy the property.
The customer went to the agency responsible for managing the big character set, https://en.wikipedia.org/wiki/CNS_11643 Despite having more characters than anything else on earth, it didn’t have those variants. The agency said they would not encode them, because they were not real characters, just printing differences.
The solution was for the staff in the office to use a “font maker” program to create a custom font with these characters. Then they could print out the deeds using a Chinese variant of Adobe Acrobat, and everyone was happy.
—
Most texts fall back to calling code points characters in much the same way we call all 128 ASCII characters, er, characters even though most of the characters below 0x20 make no sense whatsoever as characters that you or I might draw with a pen.
0x03 is
ETX
, end of text. Eh?ETX
is, of course, one of the C0 control codes used for message transmission. Few of these retain any meaning or function and certainly never corresponded with a “character” as, in this case, by definition, it marked the end of characters.I nearly used 0x04,
EOT
, end of transmission, as my example before realising that the caret notation for it is^D
which might be confused with the usual keyboard generatedEOF
with Ctrl-D, end of file which is clearly a very similar concept.They are completely unrelated, of course, as the terminal line driver determines what keystrokes generate what terminal events:
$ stty -a ... intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = M-^?; eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; discard = ^O;Here,
VEOF
is Ctrl-D – see termios(3) for more than you wanted to know.
Unicode isn’t concerned with glyphs, the pictorial representation of characters, either. Even within the same font I can see three different glyphs for U+0061 (LATIN SMALL LETTER A) – even within the constraints of ReStructuredText:
a |
regular |
a |
italic |
a |
bold |
as I pick out different visual styles. They are all U+0061, though.
Interestingly, both emacs and vi are showing the underlined variant of the ordinal indicator in the source where, at least, you can easily see that the glyphs and therefore the code points are different. YMMV.
The glyph in your font might also cause some entertainment for the ages were you to mistake 20º with 20°. Here, we have (foolishly!) crossed U+00BA (MASCULINE ORDINAL INDICATOR) with U+00B0 (DEGREE SIGN) – and that’s not the only confusion possible with a small circle glyph.
This is not just an issue with squirrelly superscripted characters but also where Punycode is used in domain names to use non-ASCII characters with similar glyphs to ASCII characters to masquerade one domain as another. The example in the Wikipedia Homoglyph page is the near identical expressions of a, U+0061 (LATIN SMALL LETTER A), and а, U+0430 (CYRILLIC SMALL LETTER A).
Browsers, hopefully, have gotten better at alerting users to the
duplicitous bаnk.com
. For more of the same [pun intended], try
the Unicode confusables
page.
Your choice of font introduces another issue. There are around 150
thousand code points defined (of the 1,114,112 possible code points)
but the font you are using might only cover a small fraction of those.
If a glyph for a code point is missing the result isn’t clearly
defined. The rendering system may substitute a glyph indicating the
code point’s number in a box or you may get a blank box. The
following is U+01FBF7 (SEGMENTED DIGIT 7), 🯷. (I see a little box
with a barely legible 01F
on one row and BF7
on another.)
There’s a much better description of some of the differences between characters and glyphs – and, indeed, characters and code points – in Unicode’s Character Encoding Model.
Unicode differs from ISO10646 in that although they maintain the same set of code points the latter is effectively an extended ISO-8859, meaning a simple list of characters except covering a lot more character sets. Unicode goes further and associates with each code point any number of categories and properties and provides rules on line-breaking, grapheme cluster boundaries, mark rendering and all sorts of other things you didn’t realise were an issue.
Don’t let the simplistic nature of the Unicode home page concern you, go straight to the Unicode reports and get stuck in.
Do read the history of UTF-8 with , and a New Jersey diner placemat.
Actually, don’t. Here, in Idio-land, we do not “support” Unicode. We use the Unicode Character Database (UCD) and some categories and properties related to that and UTF-8 encoding. We will use the “simple” lowercase and uppercase properties from the UCD to help with corresponding character mapping functions, for example.
However, Idio is not concerned with correct, legal, security or any other Unicode consideration. Idio simply uses whatever is passed to it and actions whatever the string manipulation the user invokes. If the result is non-conformant then so be it. User error.
—
We might have to consider matters such as Collation of strings – as we may not be using any system-provided collation library (which you would hope would have considered it). But we really don’t want to. That document is 29 thousand words long!
For those, like me, who often wonder about the Rest of the World there are non-obvious examples such as:
the set [ “Ähnlich”, “Äpfel”, “Bären”, “Käfer”, “küssen” ] would have the strings beginning with Ä sorted to the end as Ä is one of three additional characters and comes after the regular Latin Z in the Swedish alphabet.
in Danish, Aalborg sorts after Zaragoza because aa in personal and geographical names is å (borrowed from the Swedish) and sorts after z (a decision made 7 years after re-introducing the letter in 1948).
This also has the side-effect of the regular expression
[a-z][a-z]
does not match å in a Danish locale even though it can be expressed as a digraph (aa
).German allows for a distinct collation order for telephone listings
Of course it would be naïve to believe that there were not a soupçon of names like Cæsar and Zoë roaming around in English text, I wasn’t né stupid. I have no idea what the collation algorithm for them is, though.
I’m willing to bet that almost any decision you make on the subject of human languages and therefore their translation into digital form will be wrong because of some societal convention you were unaware was even a possibility.
I sense it will be a while before humanity adopts a common language with a normal form and until then seeking correctness will be a full-time job.
—
We can almost certainly ignore Unicode’s view on Regular Expressions as their view has been swayed by Perl Regular Expressions. In particular, their view on what constitutes an identifier isn’t going to help us where we can include many punctuation characters and, hopefully, regular Unicode identifier names.
—
As noted previously, the code point range is 221 integers but not all of those are valid “characters”. A range of values in the first 65,536 is excluded as a side-effect of handling the UTF-16 encoding when Unicode finally recognised that 65,536 code points was not, in fact, more than enough.
That’s a slightly unfair comment as 16 bits was more than enough for the original Unicode premise of handling scripts and characters in modern use. That premise has changed as Unicode now handles any number of ancient scripts and well as CJK ideographs.
Who else is looking forward to Oracle bone script?
It doesn’t affect treatment of code points but it it worth understanding that Unicode is (now) defined as 17 planes with each plane being 16 bits, ie. potentially 65,536 code points. Note that there are several code points which are in some senses invalid in any encoding including the last two code points in every plane.
Each plane is chunked up into varying sized blocks which are allocated to various character set purposes. Some very well known character sets are historically fixed, for example ISO8859-1, and the block is full. Other character sets have slots reserved within the block for future refinements.
The first plane, plane 0, is called the Basic Multilingual Plane (BMP) and is pretty much full and covers most modern languages.
Planes 1, 2 and 3 are supplementary to BMP and are filled to various degrees.
Planes 4 through 13 are unassigned!
Plane 14 contains a small number of special-purpose “characters”.
Planes 15 and 16 are designated for private use. Unlike, say, RFC1918 Private Networks which are (usually) prevented from being routed on the Internet, these Private Use planes are ripe for cross-organisational conflict. Anyone wanting to publish Klingon plqaD, Tolkien’s runic Cirth or Medieval texts (see MUFI) on the same page need to coordinate block usage. See Private Use Areas for some information on likely coordinating publishers.
Unicode isn’t a clean room setup either. They started by saying the first 256 block would be a straight copy of ISO-8859-1 (Latin-1) which is handy for users of such but it doesn’t really follow that it was the best choice in the round. There’s all sorts of compromises floating about such as the continued use of Japanese fullwidth forms – effectively duplicating ASCII.
—
The issue with handling Unicode is, um, everything. We have an issue about the encoding in use when we read Unicode in – commonly, UTF-8, UTF-16 or UTF-32. We have an issue about storing code points and strings (ie. arrays of code points) internally. And we have to decide which encoding to use when we emit code points and strings.
There’s also a subtlety relating to the meaning of code points. For
example, most of us are familiar with ASCII (and therefore Unicode)
decimal digits, 0-9 (U+0030 through to U+0039). Unicode has lots of
decimal digits, though, some 650 code points have the Nd
category
(meaning decimal number) alone. In addition Unicode supports other
numeric code points such as Roman numerals .
In principle, then, we ought to support the use of any of those code points as numeric inputs – ie. there are 65 zeroes, 65 ones, 65 twos, etc. – because we can use a Unicode attribute, Numeric_Value, associated with the code point to get its decimal value.
However, we then have to consider what it means to mix those Numeric code points across groups in the same word: 1٢۳߄५ is 12345 with a code point from each of the first five of those groups (Latin-1, Arabic-Indic, Extended Arabic-Indic, NKO, Devanagari). Does it make any sense to mix these character sets in the same expression?
It becomes increasingly complex to reason about these things and the inter-relationship between character sets at which point we start laying down the law.
Or would do. At the moment the code invokes the likes of
isdigit(3) and friends which, in turn, use locale-specific
lookup tables. Of interest, the only valid input values to those
functions are an unsigned char
or EOF
which rules out most CJK
character sets and, indeed, everything except Latin-1 in the above
example.
—
In some ways I think we could be quite pleased that the language allows you to create variables using Unicode code points (outside of Latin-1) and assign values to them using non-ASCII digits. Many people might then bemoan the unreadability of the resultant program forgetting, presumably, that, say, novels are published in foreign languages without much issue.
English appears to be the lingua franca of computing, for good or ill, and I can’t see how being flexible enough to support non-ASCII programming changes that.
More work (and someone less Western-European-centric and not a monoglot) required to sort this out.
In the meanwhile, let’s try to break Unicode down.
Code Points¶
Code points are (very) distinct from strings (in any encoding). For a
code point we want to indicate which of the 221-ish integers
we mean. We’ve previously said that the reader will use something
quite close to the Unicode consortium’s own stylised version:
#U+hhhh
.
Although hhhh
represents a hexadecimal number so any number of
h
s which return a suitable number are good.
As discussed in Constants, we can then stuff that code point
into a specific constant type, here, ccc
is 100
giving us:
int uc = 0x0127; /* LATIN SMALL LETTER H WITH STROKE */
IDIO cp = (uc << 5) | 0x10010;
obviously, there’s a couple of macros to help with that:
int uc1 = 0x0127;
IDIO cp = IDIO_UNICODE (uc1);
int uc2 = IDIO_UNICODE_VAL (cp);
Code points in and of themselves don’t do very much. We can do some
comparisons between them as the IDIO
value is a “pointer” so you
can perform pointer comparison. But not much else.
One thing to be aware is that there are no computational operations you can perform on a code point. You can’t “add one” and hope/expect to get a viable code point. Well, you can hope/expect but good luck with that.
We have the “simple” upper/lower-case mappings from the UCD although you should remember that they constitute a directed acyclic graph:
digraph lower { node [ shape=box ] "0130;LATIN CAPITAL LETTER I WITH DOT ABOVE" -> "0069;LATIN SMALL LETTER I" [ label=" lower " ]; "0069;LATIN SMALL LETTER I" -> "0049;LATIN CAPITAL LETTER I" [ label=" upper " ]; "0049;LATIN CAPITAL LETTER I" -> "0069;LATIN SMALL LETTER I" [ label=" lower " ]; }and
digraph upper { node [ shape=box ] "01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J" -> "01C7;LATIN CAPITAL LETTER LJ" [ label=" upper " ]; "01C7;LATIN CAPITAL LETTER LJ" -> "01C9;LATIN SMALL LETTER LJ" [ label=" lower " ]; "01C9;LATIN SMALL LETTER LJ" -> "01C7;LATIN CAPITAL LETTER LJ" [ label=" upper " ]; }in both cases no mapping will return you to the starting code point.
Reading¶
As noted, there are three reader forms:
#\X
for some UTF-8 encoding ofX
#\ħ
is U+0127 (LATIN SMALL LETTER H WITH STROKE)#\{...}
for some (very small set of) named characters#U+hhhh
for any code point
Writing¶
There are two forms of output depending on whether the character is being printed (ie. a reader-ready representation is being generated) or displayed (as part of a body of other output).
If printed, Unicode code points will be printed out in the standard
#U+hhhh
format with the exception of code points under 0x80 which
are isgraph(3) which will take the #\X
form.
If displayed, the Unicode code point is encoded in UTF-8.
Idio Strings¶
By which we mean arrays of code points… -ish.
History¶
If we start with, as I did, C strings we can get a handle on what I was doing.
In C a string is an array of single byte characters terminated by an ASCII NUL (0x0). That’s pretty reasonable and we can do business with that.
However, I was thinking that a lot of what we might be doing in a shell is splitting lines of text up into words (or fields if we have an awk hat on). We don’t really want to be re-allocating memory for all these words when we are unlikely to be modifying them, we really just want a substring of the original.
To make that happen we’d need:
a pointer to the original string – so we can maintain a reference so that the parent isn’t garbage collected under our feet.
an offset into the parent (or just a pointer to the start within the parent string)
a length
Which seems fine. You can imagine there’s a pathological case where you might have a substring representing one byte of a 2GB monster you read in from a file and you can’t free the space up because of your reference to the one byte substring. I suspect that if that is really a problem then maybe we can have the GC do some re-imagining under the hood next time round.
I then thought that, partly for consistency and partly for any weird cases where we didn’t have a NUL-terminated string to start with, real strings should have a length parameter as well.
If that meant we stored an extra byte (with a NUL in it) then so be it
(we’ve just casually added 4 or 8 bytes for a size_t
for the
length so an extra byte for a NUL is but a sliver of a nothingness)
and it’s quite handy for printing the value out.
That all worked a peach.
Current¶
Along comes Unicode (primarily driven by the need to port some regular expression handling as I hadn’t mustered the enthusiasm to re-write strings beforehand).
The moral equivalent of the 8-bit characters in C strings are Unicode’s code points.
However, we don’t actually want an array of code points because
that’s a bit dumb – even we can spot that. Code points are stored in
an IDIO
“pointer” and so consume 4 or 8 bytes each which is way
more than most strings require. We store code points in an IDIO
pointer because they can then be handled like any other Idio
type, not because it is efficient.
Looking at most of the text that I type, I struggle to use all the ASCII characters, let alone any of the exotic delights from cultures further a-field. I’m going to throw this out there that most of the text that you type, dear reader, fits in the Unicode Basic Multilingual Plane and is therefore encodable in 2 bytes.
I apologise to my Chinese, Japanese and Korean friends who throw 4 byte code points around with abandon. At least you’re covered and not forgotten.
That blind use of an array of code points will chew up space viciously
– even the four byte code points with 64-bit IDIO
“pointers”.
What we should be doing, as we read the string in, is to perform some
analysis as to which is the largest (widest?) code point in the string
and then construct an array where all the elements are that wide.
We already support the notion of a length so there’s no need for
trailing NULs – which don’t make any sense in the context of 4 byte
wide “characters” anyway.
str = "hello"
should only require five bytes of storage as it is only using ASCII characters.
str = "ħello"
Where the first character is U+0127 (LATIN SMALL LETTER H WITH STROKE) will now be ten bytes long as the first code point requires two bytes and therefore so will all the rest, even though they are the same ASCII characters as before. Dems da breaks.
By and large, though, I sense that most strings are going to be internally very consistent and be:
ASCII/Latin-1 and therefore 1 byte code points only
mostly BMP (2 byte) and some 1 byte code points
using 4 byte code points regularly
If we join two strings together we can upgrade/widen the one or the other as required.
The only real problem is that anyone wanting to modify an element in a string array might get caught out by trying to stuff a 4 byte code point into a one byte string.
Feeling rather pleased with my thinking I then discovered that Python had already encapsulated this idea in PEP393 and I can’t believe others haven’t done something similar.
I felt good for a bit, anyway.
So that’s the deal. Strings are arrays of elements with widths of 1, 2 or 4 bytes. The string has a length. We can have substrings of it.
I’ve no particular fix for the string modification issue. In principle it requires reworking the string under the feet of the caller but we now have to ensure that all the associated substrings are kept in sync.
A rotten workaround would be to prefix any string with a 4 byte letter then only use indexes 1 and beyond.
A better workaround would be to allow the forcible creation of, say, 4 byte strings rather than using analysis of the constituent code points.
Implementation¶
We can encode the string array’s element width in type-specific flags and then we need an array length and a pointer to the allocated memory for it.
#define IDIO_STRING_FLAG_NONE 0
#define IDIO_STRING_FLAG_1BYTE (1<<0)
#define IDIO_STRING_FLAG_2BYTE (1<<1)
#define IDIO_STRING_FLAG_4BYTE (1<<2)
struct idio_string_s {
size_t len; /* code points */
char *s;
};
#define IDIO_STRING_LEN(S) ((S)->u.string.len)
#define IDIO_STRING_S(S) ((S)->u.string.s)
#define IDIO_STRING_FLAGS(S) ((S)->tflags)
s
is a char *
(and called s
) for the obvious historic
reasons although s
is never used to access the elements of the
array directly. We will figure out the element width from the flags
and then use a uint8_t *s8
, uint16_t *s16
or uint32_t *s32
cast from s
as appropriate. The array elements are then casually
accessed with s8[i]
, s16[i]
or s32[i]
. Easy.
There’s something very similar for a substring which requires:
the reference back to the parent string
s
becomes a pointer directly into the parent string for the start of the substring – but is otherwise cast to something else in the same manner as abovelen
is the substring’s length
A substring can figure out the width of elements from its parent.
A substring of a substring is flattened to just being another substring of the original parent.
*
Amongst other possible reworks, I notice many other implementations
allocate the equivalent of the IDIO
object and the memory required
for the string storage in one block.
It would save the two pointers used by malloc(3) for its accounting and the extra calls to malloc(3)/free(3).
*
I did have an idea for a “short” string. The nominal IDIO
union
is three pointers worth – to accommodate a pair, the most
common type – could we re-work that as a container for short strings?
Three pointers worth is 12 or 24 bytes. If we used an unsigned
char
for the length then we could handle strings up to 11 or 23
bytes.
I think you’d need to do some “field analysis” to see if such short strings occur often enough to make it worth the effort.
Encoding¶
One thing hidden from the above discourse is the thorny matter of encoding. All mechanisms to move data between entities need to agree on a protocol to encode the data over the transport medium.
Unicode used to use UCS-2 and UCS-4 which have been deprecated in favour of UTF-16 and UTF-32 which are 2 and four byte encodings. I get the impression they are popular in the Windows world and they might appear as the “wide character” interfaces in Unix-land, see fgetwc(3), for example.
However, almost everything I see is geared up for UTF-8 so we’ll not buckle any trends.
Therefore, Idio expects its inputs to be encoded in UTF-8 and it will generate UTF-8 on output.
To read in UTF-8 we use Flexible and Economical UTF-8 Decoder [Hoe08], a DFA-based decoder.
’sCategories and Properties¶
Whilst looking for a JSON5 library I stumbled across Simon Schoenenberger’s Unicode character lookup table work which I have re-imagined as Unicode Summary Information.
Reading¶
The input form for a string is quite straightforward: "..."
, that
is a U+022 (QUOTATION MARK) delimited value.
The reader is, in one sense, quite naive and is strictly looking for a
non-escaped closing "
to terminate the string, see
idio_read_string()
in src/read.c
.
Subsequently the collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume with the next byte. This may result in several replacement characters being generated.
There are a couple of notes:
\
, U+005C (REVERSE SOLIDUS – backslash) is the escape character. The obvious character to escape is"
itself allowing you to embed a double-quote symbol in a double-quoted string:"hello\"world"
.In the spirit of C escape sequences Idio also allows:
¶ sequence
(hex) ASCII
description
\a
07
alert / bell
\b
08
backspace
\e
1B
escape character
\f
0C
form feed
\n
0A
newline
\r
0D
carriage return
\t
09
horizontal tab
\v
0B
vertical tab
\\
5C
backslash
\x...
up to 2 hex digits representing any byte
\u...
up to 4 hex digits representing a Unicode code point
\U...
up to 8 hex digits representing a Unicode code point
Any other escaped character results in that character.
For
\x
,\u
and\U
the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit:"\Ua9 2021"
silently stops at the SPACE character giving"© 2021"
and, correspondingly,"\u00a92021"
gives"©2021"
as a maximum of 4 hex digits are consumed by\u
.\x
is unrestricted (other than between 0x0 and 0xff) and\u
and\U
will have the hex digits converted into UTF-8.Adding
\x
bytes into a string is an exercise in due diligence.Idio allows multi-line strings:
str1 := "Hello World" str2 := "Hello\nWorld"
The string constructors for
str1
andstr2
are equivalent.
Pathnames¶
The reason the \x
escape exists is to allow more convenient
creation of “awkward” pathnames. As noted previously *nix pathnames
do not have any encoding associated with them.
Now, that’s not to say you can’t use an encoding and, indeed, we are probably encouraged to use UTF-8 as an encoding for pathnames. However, the problem with the filesystem having no encoding is that both you and I have to agree on what the encoding we have used is. You say potato, I say solanum tuberosum.
In general Idio uses the (UTF-8) encoded strings in the source code as pathnames and, as the Idio string to C string converted uses a UTF-8 generator, by and large, every pathname you create will be UTF-8 encoded.
However, any pathname already in the filesystem is of an unknown encoding of which the only thing we know is that it won’t contain an ASCII NUL and it won’t have U+0027 (SOLIDUS – forward slash) in a directory entry. It’s a C string.
Of course we “get away with” an implicit UTF-8 encoding because the vast majority of *nix filenames are regular ASCII ones. Only those users outside of North America and islands off the coast of North Western Europe have suffered.
So what we really need is to handle such pathnames correctly which means, not interpret them.
Technically, then, the Idio string "hello"
and the
filename hello
are different. Which is going to be a bit
annoying from time to time.
Note
Of interest, it is possible to manipulate strings such that you
have a 1-byte width encoding for "hello"
and a 2 or 4-byte
encoding for "hello"
. However, those strings will be
considered equal?
because they have the same length and the
same code point at each of their indices.
As it so happens, for general file access and creation, the
Idio strings, converted into UTF-8 encoded C strings
will do the right thing. However, if you consume filenames from the
filesystem they will be treated as pathnames and will not be
equal?
to the nominally equivalent Idio string.
So we need to be able to create *nix pathnames ourselves for these
sorts of comparisons and we have a formatted string style to do that:
%P"..."
(or %P(...)
or %P{...}
or %P[...]
) where the
...
is a regular string as above.
That’s where the \x
escape for strings comes into its own. If we
know that a filename starts with ISO8859-1’s 0xA9 (the same
“character” as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte,
0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a
string: %P"\xa9..."
.
The %P
formatted string is fine if we hand-craft our pathname for
comparison but if we already have an Idio string in our hands
we need a converter, string->pathname
, which will, return a
pathname from the UTF-8 encoding of your string. Which sounds
slightly pointless but gets us round the problem of matching against
pathnames in the filesystem which have no encoding.
Idio> string->pathname "hello"
%P"hello"
Notice the leading %P
indicating it is a pathname.
Pathname Expansion¶
As if pathnames as an unencoded string aren’t complicated enough we want wildcards!
Bash has a reasonably pragmatic approach to wildcards. From bash(1), Pathname Expansion:
After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, and is not quoted, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern
Unfortunately, our free hand with the code points allowed in symbols
means that *
and ?
are not just possible but entirely probable
and, certainly, to be expected. That makes wildcards a bit tricky.
Hmm. I’ve noted before that Murex constrains globbing to
@{g *.c}
and extends the mechanism to regexps,
@{rx \.c$}
, and file modes, @{f +d}
.
Although I think we want a mechanism to do globbing which is distinct from any sorting and filtering you might perform on the list.
We’ve mention Pathname Templates before although they are a little bit magical.
Obviously, %P...
and #P...
are ripe for confusion. I try
to think of the %
suggesting a printf(3) format
whereas the #
is suggesting the construction of a weird thing.
I want #P{...}
(or #P(...)
or #P[...]
or #P"..."
) to
create a pathname template but not exercise it yet. This is akin to
creating a regular expression in advance, the regcomp before the regexec of POSIX
regexs.
Only when it is used do we glob(3) the expression. This does lead to a little confusion:
Idio> p := #P"x.*"
#<SI ~path pattern:%P"x.*">
Idio> p
(%P"x.idio")
Idio> printf "%s\n" p
(%P"x.idio")
#<unspec>
which feels OK but
Idio> printf "%s\n" #P"x.*"
#<SI ~path pattern:%P"x.*">
#<unspec>
Idio> ls #P"x.*"
x.idio
#t
Feels wrong. Why does the expansion occur for ls and not
for printf
? Technically, for both, the arguments are constructed
as you would expect giving both of them the struct instance of a
~path
(mnemonically, a dynamic path). However, because we need to
convert all of the arguments to strings for execve(2), the
use of a ~path
struct instance is expanded into a list of
pathnames.
It’s not great.
In the meanwhile, we can do the right sorts of things creating and matching files with non-UTF-8 characters in them:
Idio> close-handle (open-output-file %P"\xa9 2021")
#unspec
Idio> p := #P"\xa9*"
#<SI ~path pattern:%P"\xA9*">
Idio> p
(%P"\xA9 2021")
Idio> ls -1f
...
''$'\251'' 2021'
...
I see, here, that ls is using a Bash-style
$'...'
quoting such that ''$'\251'' 2021'
is the concatenation
of ''
, $'\251'
and ' 2021'
with \251
being the octal
for 0xA9.
For Idio, specifically when printing strings in a “use
escapes” context, here, at the REPL, a pathname string will have
non-isprint(3) characters converted to a \x
form,
hence, %P"\xA9 2021"
. Similarly, a pathname with a newline in it
would be, say, %P"hello\nworld"
.
When not printing strings in a “use escapes” context, notably when preparing arguments for execve(2) then we just get the raw *nix pathname characters meaning something like ls won’t get in a tizzy:
Idio> close-handle (open-output-file %P"\xa9 2021")
#<unspec>
Idio> ls %P"\xa9 2021"
''$'\251'' 2021'
#t
compare with the variations for when 0xA9 is an invalid UTF-8 encoding in a regular string and when the Unicode code point U+00A9 is used:
Idio> ls "\xa9 2021"
/usr/bin/ls: cannot access ''$'\357\277\275'' 2021': No such file or directory
#f
job 327830: (ls "� 2021"): completed: (exit 2)
Idio> ls "\ua9 2021"
/usr/bin/ls: cannot access ''$'\302\251'' 2021': No such file or directory
#f
job 327834: (ls "© 2021"): completed: (exit 2)
The first is completely garbled and you can see the copyright sign in the notification about the command failure in the second with ls complaining that it can’t access something beginning with (*quickly translates*) 0xC2 0xA9, the UTF-8 encoding of 0xA9.
*
That said, regex and, by extension pattern matching, are not affected by this distinction as they aren’t concerned about the equality of the strings and/or pathnames so much as whether they conform to a (regular expression) pattern:
pt := #P{\xA9*}
map (function (p) {
printf "%s: " p
(pattern-case p
("*2021" {
printf "2021!\n"
})
(else {
printf "nope!\n"
}))
}) pt
Interpolated Strings¶
Sometimes we want to embed references to variables, usually, or expressions in a string and have something figure out the current value of the variable or result of the expression and replace the reference with that result, something along the lines of:
name := "Bob"
"Hi, ${name}!"
I want double-quoted strings, the usual "..."
, to remain as fixed
entities, much like single-quoted strings in, say, Bash are.
That means we need another format, another #
format! Let’s go for
#S{ ... }
and everything between the matching braces is our
putative string.
The ${name}
-style format seems good to me, the only issue being
whether we want to change $
for another sigil. In the usual way,
we’d pass that in between the S
and {
:
#S{Hi, ${name}!}
#S^{Hi, ^{name}!}
The second interpolation character is the escape character, defaulting
to \
, as usual.
You can use the usual matching bracketing characters, {}
, []
and ()
to delimit the main block:
#S[${name} is ${string-length name} letters long.]
but only braces, {}
for the references.
There is a subtlety here as the results of the expressions are not
necessarily themselves strings. The ${string-length name}
expression, for example, will result in an integer and the implied
append-string
constructing the result of the interpolated string
will get upset.
So, in practice, all of the elements of the putative string, the
non-expression strings and the results of the expressions are map
ed against ->string
which leaves strings alone and runs the
“display” variant of the printer for the results of the expressions.
->string
does not perform any splicing so if your expression
returns a list then you’ll get a list in your string.
Writing¶
Idio strings will be UTF-8 encoded on output, see
idio_utf8_string()
in src/unicode.c
for the details.
There’s a couple of qualification to that:
We can ask for the reader’s C escape sequences to be reproduced in their
\X
format, eg.\a
for alert / bell.We can ask for the printed string to be quoted with double quotes.
This latter option is a consequence of how we visualise printed entities. The REPL will print values in a reader-ready format, so including leading and trailing
"
s.Idio> str := "Hello\nWorld" "Hello\nWorld" Idio> str := "Hello World" "Hello\nWorld"
By and large, though, most things will display a string value as part of a larger output:
Idio> printf "'%s'\n" str 'Hello World'
(Actually, a trailing
#<unspec>
will also be printed which is the value thatprintf
returned.)
Operations¶
Characters¶
- function unicode? value¶
is value a Unicode code point
- function unicode->plane cp¶
return the Unicode plane of code point cp
The result is a fixnum.
- function unicode->plane-cp cp¶
return the lower 16-bits of the code point cp
- function unicode->integer cp¶
convert code point cp to a fixnum
- function unicode=? cp1 cp2 [...]¶
compare code points for equality
A minimum of two code points are required.
Strings¶
- function string? value¶
is value a string (or substring)
- function make-string size [fill]¶
create a string of length size filled with fill characters or U+0020 (SPACE)
- function string->list string¶
return a list of the Unicode code points in string
See also list->string.
- function string->symbol string¶
return a symbol constructed from the UTF-8 Unicode code points in string
See also symbol->string.
- function append-string [string ...]¶
return a string constructed by appending the string arguments together
If no strings are supplied the result is a zero-length string, cf.
""
.
- function concatenate-string list¶
return a string constructed by appending the strings in list together
If no strings are supplied the result is a zero-length string, cf.
""
.
- function copy-string string¶
return a copy of string string
- function string-length string¶
return the length of string string
- function string-ref string index¶
return the Unicode code point at index index of string string
Indexes start at zero.
- function string-set! string index cp¶
set the Unicode code point at index index of string string to be the Unicode code point cp
Indexes start at zero.
If the number of bytes required to store cp is greater than the per-code point width of string a
^string-error
condition will be raised.
- function string-fill! string fill¶
set all indexes of string string to be the Unicode code point fill
If the number of bytes required to store fill is greater than the per-code point width of string a
^string-error
condition will be raised.
- function substring string pos-first pos-next¶
return a substring of string string starting at index pos-first and ending before pos-next.
Indexes start at zero.
If pos-first and pos-next are inconsistent a
^string-error
condition will be raised.
- function string<=? s1 s2 [...]¶
- function string<? s1 s2 [...]¶
- function string=? s1 s2 [...]¶
- function string>=? s1 s2 [...]¶
- function string>? s1 s2 [...]¶
Warning
Historic code for ASCII/Latin-1 Scheme strings badgered into working at short notice.
These need to be replaced with something more Unicode-aware.
Perform strncmp(3) comparisons of the UTF-8 representations of the string arguments
- function string-ci<=? s1 s2 [...]¶
- function string-ci<? s1 s2 [...]¶
- function string-ci=? s1 s2 [...]¶
- function string-ci>=? s1 s2 [...]¶
- function string-ci>? s1 s2 [...]¶
Warning
Historic code for ASCII/Latin-1 Scheme strings badgered into working at short notice.
These need to be replaced with something more Unicode-aware.
Perform strncasecmp(3) comparisons of the UTF-8 representations of the string arguments
- function split-string string [delim]¶
Split string string into a list of string delimited by the code points in the string delim which itself defaults to IFS.
split-string
acts like the shell’s or awk’s word-splitting byIFS
in that multiple adjacent instances of delimiter characters only provoke one “split.”
- function split-string-exactly string [delim]¶
Split string string into a list of string delimited by the code points in the string delim which itself defaults to IFS.
split-string-exactly
is meant to act more like a regular expression matching system.It was originally required to split the contents of the Unicode Character Database file
utils/Unicode/UnicodeData.txt
– which has multiple;
-separated fields, often with no value in a field – to help generate the code base for regular expression handling.
- function fields string¶
A variation on split-string with a view to more awk-like line splitting functionality,
fields
splits string string into an array of strings delimited by the code points in IFS with the first element of the array being the original string.Indexing of the array gives you awk-like field numbers as in awk’s
$0
,$1
, …, here, array index0
for the original string, index1
for the first IFS delimited field,2
for the second IFS delimited field etc..As a function taking a single argument it can be used with the value-index operator,
.
:Idio> fs := "hello world".fields #[ "hello world" "hello" "world" ] Idio> fs.0 "hello world" Idio> fs.1 "hello"
- function join-string delim list¶
construct a string from the strings in list with the string delim placed in between each pair of strings
list is a, uh, list, here, unlike, say, append-string as it follows the Scheme form (albeit with arguments shifted about) which takes another parameter indicating the style in which the delimiter should be applied, such as: before or after every argument, infix (the default) and a strict infix for complaining about no arguments.
- function strip-string str discard [ends]¶
return a string where the characters in discard have been removed from the ends of str
ends can be one of
'left
,'right
,'both
or'none
and defaults to'right
.
Last built at 2024-12-21T07:11:05Z+0000 from 463152b (dev)