String Type¶

Strings are arrays of Unicode code points efficiently packed into variable-width arrays.

Substrings are references into sections of Idio strings but are otherwise handled the same.

Pathnames are a subset of strings where the elements of the string are not treated as UTF-8. Any file name value returned from the operating system will be a pathname.

Consequently, you cannot directly compare a file name from the file system to a string from your source code. See string->pathname for a conversion function. There is no reverse function (pathname to string) as there is no encoding in a file name, it is just a sequence of bytes.

Reader Form¶

The input form for a string is the usual "...", that is a U+0022 (QUOTATION MARK) delimited value.

The collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) �, U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume with the next byte. This may result in several replacement characters being generated.

There are a couple of notes:

\, U+005C (REVERSE SOLIDUS – backslash) is the escape character. The obvious character to escape is " itself allowing you to embed a double-quote symbol in a double-quoted string: "hello\"world".

In the spirit of C escape sequences Idio also allows:

Supported escape sequences in strings¶
sequence	(hex) ASCII	description
`\a`	07	alert / bell
`\b`	08	backspace
`\e`	1B	escape character
`\f`	0C	form feed
`\n`	0A	newline
`\r`	0D	carriage return
`\t`	09	horizontal tab
`\v`	0B	vertical tab
`\\`	5C	backslash
`\x...`		up to 2 hex digits representing any byte
`\u...`		up to 4 hex digits representing a Unicode code point
`\U...`		up to 8 hex digits representing a Unicode code point

Any other escaped character results in that character.

For \x, \u and \U the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit: "\Ua9 2021" silently stops at the SPACE character giving "© 2021" and, correspondingly, "\u00a92021" gives "©2021" as a maximum of 4 hex digits are consumed by \u.

\x is unrestricted (other than between 0x0 and 0xff) and \u and \U will have the hex digits converted into UTF-8.

Adding \x bytes into a string is an exercise in due diligence.

Idio allows multi-line strings:
```
str1 := "Hello
World"

str2 := "Hello\nWorld"
```
The string constructors for str1 and str2 are equivalent.

Pathnames¶

%P"..." (or matching brackets, %P(...) or %P{...} or %P[...] or, in general, %Pc...c) where the ... is a regular string as above.

That’s where the \xHH escape for strings comes into its own. If we know that a filename starts with ISO8859-1’s 0xA9 (the same “character” as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte, 0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a string: %P"\xa9...".

Pathnames, or strings being used as pathnames, with an ASCII NUL (\x00) will result in a format error when they are attempted to be used. They are perfectly valid code points for Idio strings but it is not possible to have an ASCII NUL in a C string, being passed to the operating system’s API.

Octet Strings¶

%B"..." (or matching brackets, %B(...) or %B{...} or %B[...] or, in general, %Bc...c) where the ... is a regular string as above.

Note

The name, byte string, seems too overloaded but the nominal reader form, %O is too easily confused with a putative %0. So we have a mixed result, the name, octet string, with a reader form derived from byte string.

Mixing Strings¶

You can append-string strings together and join-string strings with a delimiter but be careful as mixing string variants will result in a gracefully degraded result: unicode to pathname to octet-string.

Interpolated Strings¶

From time to time it is convenient to want to expand references to variables inside a string. There is a special reader form for such interpolated strings:

#S{...${expr}...}

Here, everything between the outermost matching { and } are scanned for instances of the interpolation sigil, $. A matching set of { and } is read in and the expression therein is evaluated, the result being converted to a string (if required) and replacing the interpolated expression. The rest of the string is added in a similar way.

If you want to embed an actual interpolation sigil, $, you can escape it with the default escape character \:

#S{Your \$PATH will be '${(frob-path)}'!}

Whatever the call to frob-path returns will be converted to a string (if necessary) giving a string equivalent to:

"Your $PATH will be '...'!"

In this particular case, there’s little advantage over using sprintf etc. but in code generation it is much more convenient to see (pre-)constructed variable references in situ in the expected output.

There are two options you can pass, between the #S and opening brace: an alternative interpolation sigil and an alternative escape character.

In effect, normal behaviour is:

#S$\{...}

If you only want to change the escape character, use . for the interpolation sigil – which implies that the interpolation sigil cannot be ..

If the use of braces, { and }, means you would need to escape braces within the interpolated string a lot you can use parenthesis or brackets as the delimiting pair:

; generate some C code
printf #S[
if ($condition) {
    doit(${c-name arg1}, ${c-name arg2});
}
]

although note that you can only use braces for the expression delimiters.

String Predicates¶

function string? o¶

test if o is an string

Param o:: object to test
Return:: #t if o is an string, #f otherwise

function pathname? o¶

test if o is an pathname

Param o:: object to test
Return:: #t if o is an pathname, #f otherwise

Note

type->string will report a pathname as a string.

function octet-string? o¶

test if o is an octet string

Param o:: object to test
Return:: #t if o is an octet string, #f otherwise

Note

type->string will report an octet-string as a string.

String Constructors¶

function make-string size [fillc]¶

create a string with an initial length of size

Param size:: initial string size
Type size:: integer
Param fillc:: fill character value, defaults to #\{space}
Type fillc:: unicode, optional
Return:: the new string
Rtype:: string

function substring s p0 [pn]¶

return a substring of s from position p0 through to but excluding position pn

Param s:: string
Type s:: string
Param p0:: position
Type p0:: integer
Param pn:: position, defaults to string length
Type pn:: integer, optional
Return:: the substring
Rtype:: string

If p0 or pn are negative they are considered to be with respect to the end of the string. This can still result in a negative index.

Note

Technically, the return type is a substring but as substrings are indistinct from strings at a user level then a return type of string suffices.

type->string will reveal the difference.

function list->string l¶

return a string from the list of the Unicode code points in l

Param l:: list of code points
Type s:: list
Return:: string
Rtype:: string

function symbol->string s¶

convert symbol s into a string

Param s:: symbol to convert
Type s:: symbol
Return:: string
Rtype:: string

function keyword->string kw¶

convert keyword kw to a string

Param kw:: keyword to convert
Type kw:: keyword
Return:: string

function string->pathname s¶

return a pathname of the UTF-8 encoding of s

Param s:: string
Type s:: string
Return:: pathname
Rtype:: pathname

function string->octet-string s¶

return an octet string of the UTF-8 encoding of s

Param s:: string
Type s:: string
Return:: octet string
Rtype:: octet string

function octet-string->string s¶

return a string from the UTF-8 decoding of s

Param s:: string
Type s:: octet string
Return:: string
Rtype:: string

Warning

This is highly likely to generate #U+FFFD REPLACEMENT CHARACTER in the resultant string.

function ->string o¶

convert o to a string unless it already is a string

Param o:: object to convert
Return:: a string representation of o

->string differs from string in that it won’t stringify a string!

function string o¶

convert o to a string

Param o:: object to convert
Return:: a string representation of o

String Attributes¶

function string-length s¶

return the number of code points in s

Param s:: string
Type s:: string
Return:: number of code points
Rtype:: integer

function string-ref s index¶

return code point at position index in s

positions start at 0

Param s:: string
Type s:: string
Param index:: position
Type index:: integer
Return:: code point
Rtype:: unicode

function string-set! s index c¶

set position index of s to c

positions start at 0

Param s:: string
Type s:: string
Param index:: position
Type index:: integer
Param c:: code point
Type c:: unicode
Return:: #<unspec>

string-set! will fail if c is wider than the existing storage allocation for s

function string-fill! s fill¶

set all positions of s to fill

Param s:: string
Type s:: string
Param fill:: code point
Type fill:: unicode
Return:: #<unspec>

string-fill! will fail if c is wider than the existing storage allocation for s

String Functions¶

function append-string [args]¶

append strings

Param args:: strings to append together
Type args:: list, optional
Return:: string (”” if no args supplied)

append-string will gracefully degrade the string variant based on the arguments: unicode > pathname > octet-string

append-string takes multiple arguments each of which is a string.

See also

concatenate-string which takes a single argument which is a list of strings.

function concatenate-string ls¶

concatenate strings in list ls

Param ls:: list of strings to concatenate together
Type ls:: list, optional
Return:: string (”” if ls is #n)

concatenate-string takes a single argument, which is a list of strings. It is roughly comparable to

apply append-string ls

See also

append-string takes multiple arguments each of which is a string.

function copy-string s¶

return a copy of s which is not eq? to s

Param s:: string
Type s:: string
Return:: string
Rtype:: string

function join-string delim args¶

return a string of args interspersed with delim

Param delim:: string
Type delim:: string
Param args:: string(s) to be joined
Type args:: list, optional
Return:: string (”” if args is #n)

function string-index s c¶

return the index of c in s or #f

Param s:: string
Type s:: string
Param c:: code point
Type c:: unicode
Return:: index or #f
Rtype:: integer or #f

function string-rindex s c¶

return the rightmost index of c in s or #f

Param s:: string
Type s:: string
Param c:: code point
Type c:: unicode
Return:: index or #f
Rtype:: integer or #f

function fields in¶

split string in using characters from IFS into an array with the first element the original string

Param in:: string to split
Type in:: string
Return:: array (of strings)

Adjacent characters from IFS are considered a single delimiter.

See also

split-string which returns a list

function split-string in [delim]¶

split string in using characters from delim into a list of strings

Param in:: string to split
Type in:: string
Param delim:: delimiter characters, defaults to IFS
Type delim:: string, optional
Return:: list (of strings)

Adjacent characters from delim are considered a single delimiter.

See also

split-string-exactly which treats delim more fastidiously and fields which returns an array

function split-string-exactly in [delim]¶

split string in using characters from delim into a list of strings

Param in:: string to split
Type in:: string
Param delim:: delimiter characters, defaults to IFS
Type delim:: string, optional
Return:: list (of strings)

Adjacent characters from delim are considered separate delimiters.

See also

split-string

function string<? s1 s2 [...]¶

apply the less-than comparator to strings

Param s1:: string
Type s1:: string
Param s2:: string
Type s2:: string
Return:: the result of comparing the arguments
Rtype:: boolean

string<? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are less-than the argument to their left otherwise the result is #f.

string<? converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.