String Type

Strings are arrays of Unicode code points efficiently packed into variable-width arrays.

Substrings are references into sections of Idio strings but are otherwise handled the same.

Pathnames are a subset of strings where the elements of the string are not treated as UTF-8. Any file name value returned from the operating system will be a pathname.

Consequently, you cannot directly compare a file name from the file system to a string from your source code. See string->pathname for a conversion function. There is no reverse function (pathname to string) as there is no encoding in a file name, it is just a sequence of bytes.

Reader Form

The input form for a string is the usual "...", that is a U+0022 (QUOTATION MARK) delimited value.

The collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) �, U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume with the next byte. This may result in several replacement characters being generated.

There are a couple of notes:

  1. \, U+005C (REVERSE SOLIDUS – backslash) is the escape character. The obvious character to escape is " itself allowing you to embed a double-quote symbol in a double-quoted string: "hello\"world".

    In the spirit of C escape sequences Idio also allows:

    Supported escape sequences in strings

    sequence

    (hex) ASCII

    description

    \a

    07

    alert / bell

    \b

    08

    backspace

    \e

    1B

    escape character

    \f

    0C

    form feed

    \n

    0A

    newline

    \r

    0D

    carriage return

    \t

    09

    horizontal tab

    \v

    0B

    vertical tab

    \\

    5C

    backslash

    \x...

    up to 2 hex digits representing any byte

    \u...

    up to 4 hex digits representing a Unicode code point

    \U...

    up to 8 hex digits representing a Unicode code point

    Any other escaped character results in that character.

    For \x, \u and \U the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit: "\Ua9 2021" silently stops at the SPACE character giving 2021" and, correspondingly, "\u00a92021" gives "©2021" as a maximum of 4 hex digits are consumed by \u.

    \x is unrestricted (other than between 0x0 and 0xff) and \u and \U will have the hex digits converted into UTF-8.

    Adding \x bytes into a string is an exercise in due diligence.

  2. Idio allows multi-line strings:

    str1 := "Hello
    World"
    
    str2 := "Hello\nWorld"
    

    The string constructors for str1 and str2 are equivalent.

Pathnames

%P"..." (or matching brackets, %P(...) or %P{...} or %P[...] or, in general, %Pc...c) where the ... is a regular string as above.

That’s where the \xHH escape for strings comes into its own. If we know that a filename starts with ISO8859-1’s 0xA9 (the same “character” as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte, 0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a string: %P"\xa9...".

Pathnames, or strings being used as pathnames, with an ASCII NUL (\x00) will result in a format error when they are attempted to be used. They are perfectly valid code points for Idio strings but it is not possible to have an ASCII NUL in a C string, being passed to the operating system’s API.

Octet Strings

%B"..." (or matching brackets, %B(...) or %B{...} or %B[...] or, in general, %Bc...c) where the ... is a regular string as above.

Note

The name, byte string, seems too overloaded but the nominal reader form, %O is too easily confused with a putative %0. So we have a mixed result, the name, octet string, with a reader form derived from byte string.

Mixing Strings

You can append-string strings together and join-string strings with a delimiter but be careful as mixing string variants will result in a gracefully degraded result: unicode to pathname to octet-string.

Interpolated Strings

From time to time it is convenient to want to expand references to variables inside a string. There is a special reader form for such interpolated strings:

#S{...${expr}...}

Here, everything between the outermost matching { and } are scanned for instances of the interpolation sigil, $. A matching set of { and } is read in and the expression therein is evaluated, the result being converted to a string (if required) and replacing the interpolated expression. The rest of the string is added in a similar way.

If you want to embed an actual interpolation sigil, $, you can escape it with the default escape character \:

#S{Your \$PATH will be '${(frob-path)}'!}

Whatever the call to frob-path returns will be converted to a string (if necessary) giving a string equivalent to:

"Your $PATH will be '...'!"

In this particular case, there’s little advantage over using sprintf etc. but in code generation it is much more convenient to see (pre-)constructed variable references in situ in the expected output.

There are two options you can pass, between the #S and opening brace: an alternative interpolation sigil and an alternative escape character.

In effect, normal behaviour is:

#S$\{...}

If you only want to change the escape character, use . for the interpolation sigil – which implies that the interpolation sigil cannot be ..

If the use of braces, { and }, means you would need to escape braces within the interpolated string a lot you can use parenthesis or brackets as the delimiting pair:

; generate some C code
printf #S[
if ($condition) {
    doit(${c-name arg1}, ${c-name arg2});
}
]

although note that you can only use braces for the expression delimiters.

String Predicates

function string? o

test if o is an string

Param o:

object to test

Return:

#t if o is an string, #f otherwise

function pathname? o

test if o is an pathname

Param o:

object to test

Return:

#t if o is an pathname, #f otherwise

Note

type->string will report a pathname as a string.

function octet-string? o

test if o is an octet string

Param o:

object to test

Return:

#t if o is an octet string, #f otherwise

Note

type->string will report an octet-string as a string.

String Constructors

function make-string size [fillc]

create a string with an initial length of size

Param size:

initial string size

Type size:

integer

Param fillc:

fill character value, defaults to #\{space}

Type fillc:

unicode, optional

Return:

the new string

Rtype:

string

function substring s p0 [pn]

return a substring of s from position p0 through to but excluding position pn

Param s:

string

Type s:

string

Param p0:

position

Type p0:

integer

Param pn:

position, defaults to string length

Type pn:

integer, optional

Return:

the substring

Rtype:

string

If p0 or pn are negative they are considered to be with respect to the end of the string. This can still result in a negative index.

Note

Technically, the return type is a substring but as substrings are indistinct from strings at a user level then a return type of string suffices.

type->string will reveal the difference.

function list->string l

return a string from the list of the Unicode code points in l

Param l:

list of code points

Type s:

list

Return:

string

Rtype:

string

function symbol->string s

convert symbol s into a string

Param s:

symbol to convert

Type s:

symbol

Return:

string

Rtype:

string

function keyword->string kw

convert keyword kw to a string

Param kw:

keyword to convert

Type kw:

keyword

Return:

string

function string->pathname s

return a pathname of the UTF-8 encoding of s

Param s:

string

Type s:

string

Return:

pathname

Rtype:

pathname

function string->octet-string s

return an octet string of the UTF-8 encoding of s

Param s:

string

Type s:

string

Return:

octet string

Rtype:

octet string

function octet-string->string s

return a string from the UTF-8 decoding of s

Param s:

string

Type s:

octet string

Return:

string

Rtype:

string

Warning

This is highly likely to generate #U+FFFD REPLACEMENT CHARACTER in the resultant string.

function ->string o

convert o to a string unless it already is a string

Param o:

object to convert

Return:

a string representation of o

->string differs from string in that it won’t stringify a string!

function string o

convert o to a string

Param o:

object to convert

Return:

a string representation of o

String Attributes

function string-length s

return the number of code points in s

Param s:

string

Type s:

string

Return:

number of code points

Rtype:

integer

function string-ref s index

return code point at position index in s

positions start at 0

Param s:

string

Type s:

string

Param index:

position

Type index:

integer

Return:

code point

Rtype:

unicode

function string-set! s index c

set position index of s to c

positions start at 0

Param s:

string

Type s:

string

Param index:

position

Type index:

integer

Param c:

code point

Type c:

unicode

Return:

#<unspec>

string-set! will fail if c is wider than the existing storage allocation for s

function string-fill! s fill

set all positions of s to fill

Param s:

string

Type s:

string

Param fill:

code point

Type fill:

unicode

Return:

#<unspec>

string-fill! will fail if c is wider than the existing storage allocation for s

String Functions

function append-string [args]

append strings

Param args:

strings to append together

Type args:

list, optional

Return:

string (”” if no args supplied)

append-string will gracefully degrade the string variant based on the arguments: unicode > pathname > octet-string

append-string takes multiple arguments each of which is a string.

See also

concatenate-string which takes a single argument which is a list of strings.

function concatenate-string ls

concatenate strings in list ls

Param ls:

list of strings to concatenate together

Type ls:

list, optional

Return:

string (”” if ls is #n)

concatenate-string takes a single argument, which is a list of strings. It is roughly comparable to

apply append-string ls

See also

append-string takes multiple arguments each of which is a string.

function copy-string s

return a copy of s which is not eq? to s

Param s:

string

Type s:

string

Return:

string

Rtype:

string

function join-string delim args

return a string of args interspersed with delim

Param delim:

string

Type delim:

string

Param args:

string(s) to be joined

Type args:

list, optional

Return:

string (”” if args is #n)

function string-index s c

return the index of c in s or #f

Param s:

string

Type s:

string

Param c:

code point

Type c:

unicode

Return:

index or #f

Rtype:

integer or #f

function string-rindex s c

return the rightmost index of c in s or #f

Param s:

string

Type s:

string

Param c:

code point

Type c:

unicode

Return:

index or #f

Rtype:

integer or #f

function fields in

split string in using characters from IFS into an array with the first element the original string

Param in:

string to split

Type in:

string

Return:

array (of strings)

Adjacent characters from IFS are considered a single delimiter.

See also

split-string which returns a list

function split-string in [delim]

split string in using characters from delim into a list of strings

Param in:

string to split

Type in:

string

Param delim:

delimiter characters, defaults to IFS

Type delim:

string, optional

Return:

list (of strings)

Adjacent characters from delim are considered a single delimiter.

See also

split-string-exactly which treats delim more fastidiously and fields which returns an array

function split-string-exactly in [delim]

split string in using characters from delim into a list of strings

Param in:

string to split

Type in:

string

Param delim:

delimiter characters, defaults to IFS

Type delim:

string, optional

Return:

list (of strings)

Adjacent characters from delim are considered separate delimiters.

See also

split-string

function string<? s1 s2 [...]

apply the less-than comparator to strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string<? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are less-than the argument to their left otherwise the result is #f.

string<? converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string<=? s1 s2 [...]

apply the less-than-or-equal comparator to strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string<=? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are less-than-or-equal to the argument to their left otherwise the result is #f.

string<=? converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string=? s1 s2 [...]

apply the equality comparator to strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string=? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are equal to the argument to their left otherwise the result is #f.

string=? converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string>=? s1 s2 [...]

apply the greater-than-or-equal comparator to strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string>=? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are greater-than-or-equal to the argument to their left otherwise the result is #f.

string>=? converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string>? s1 s2 [...]

apply the greater-than comparator to strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string>? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are greater-than the argument to their left otherwise the result is #f.

string>? converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string-ci<? s1 s2 [...]

apply the less-than comparator to case-insensitive strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string<? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are less-than the argument to their left otherwise the result is #f.

string<? converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string-ci<=? s1 s2 [...]

apply the less-than-or-equal comparator to case-insensitive strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string<=? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are less-than-or-equal to the argument to their left otherwise the result is #f.

string<=? converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string-ci=? s1 s2 [...]

apply the equality comparator to case-insensitive strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string=? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are equal to the argument to their left otherwise the result is #f.

string=? converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string-ci>=? s1 s2 [...]

apply the greater-than-or-equal comparator to case-insensitive strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string>=? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are greater-than-or-equal to the argument to their left otherwise the result is #f.

string>=? converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function string-ci>? s1 s2 [...]

apply the greater-than comparator to case-insensitive strings

Param s1:

string

Type s1:

string

Param s2:

string

Type s2:

string

Return:

the result of comparing the arguments

Rtype:

boolean

string>? with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is #t if all subsequent arguments are greater-than the argument to their left otherwise the result is #f.

string>? converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.

If the strings are considered equal then the shorter string is considered less than the longer string.

function strip-string str discard [ends]

return a string which is str without leading, trailing (or both) discard characters

Param str:

string

Type str:

string

Param discard:

string

Type discard:

string

Param ends:

'left, 'right (default), 'both or 'none

Type ends:

symbol, optional

Return:

string

The returned value could be str or a substring of str

Last built at 2024-05-17T06:10:59Z+0000 from 62cca4c (dev) for Idio 0.3.b.6