String Type¶
Strings are arrays of Unicode code points efficiently packed into variable-width arrays.
Substrings are references into sections of Idio strings but are otherwise handled the same.
Pathnames are a subset of strings where the elements of the string are not treated as UTF-8. Any file name value returned from the operating system will be a pathname.
Consequently, you cannot directly compare a file name from the file system to a string from your source code. See string->pathname for a conversion function. There is no reverse function (pathname to string) as there is no encoding in a file name, it is just a sequence of bytes.
Reader Form¶
The input form for a string is the usual "..."
, that is a U+0022
(QUOTATION MARK) delimited value.
The collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) �, U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume with the next byte. This may result in several replacement characters being generated.
There are a couple of notes:
\
, U+005C (REVERSE SOLIDUS – backslash) is the escape character. The obvious character to escape is"
itself allowing you to embed a double-quote symbol in a double-quoted string:"hello\"world"
.In the spirit of C escape sequences Idio also allows:
¶ sequence
(hex) ASCII
description
\a
07
alert / bell
\b
08
backspace
\e
1B
escape character
\f
0C
form feed
\n
0A
newline
\r
0D
carriage return
\t
09
horizontal tab
\v
0B
vertical tab
\\
5C
backslash
\x...
up to 2 hex digits representing any byte
\u...
up to 4 hex digits representing a Unicode code point
\U...
up to 8 hex digits representing a Unicode code point
Any other escaped character results in that character.
For
\x
,\u
and\U
the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit:"\Ua9 2021"
silently stops at the SPACE character giving"© 2021"
and, correspondingly,"\u00a92021"
gives"©2021"
as a maximum of 4 hex digits are consumed by\u
.\x
is unrestricted (other than between 0x0 and 0xff) and\u
and\U
will have the hex digits converted into UTF-8.Adding
\x
bytes into a string is an exercise in due diligence.Idio allows multi-line strings:
str1 := "Hello World" str2 := "Hello\nWorld"
The string constructors for
str1
andstr2
are equivalent.
Pathnames¶
%P"..."
(or matching brackets, %P(...)
or %P{...}
or
%P[...]
or, in general, %Pc...c
) where the ...
is
a regular string as above.
That’s where the \xHH
escape for strings comes into its
own. If we know that a filename starts with ISO8859-1’s 0xA9 (the
same “character” as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte,
0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a
string: %P"\xa9..."
.
Pathnames, or strings being used as pathnames, with an ASCII NUL
(\x00
) will result in a format error when they are attempted to be
used. They are perfectly valid code points for Idio strings
but it is not possible to have an ASCII NUL in a C string,
being passed to the operating system’s API.
Octet Strings¶
%B"..."
(or matching brackets, %B(...)
or %B{...}
or
%B[...]
or, in general, %Bc...c
) where the ...
is
a regular string as above.
Note
The name, byte string, seems too overloaded but the nominal reader
form, %O
is too easily confused with a putative %0
. So we
have a mixed result, the name, octet string, with a reader form
derived from byte string.
Mixing Strings¶
You can append-string strings together and join-string strings with a delimiter but be careful as mixing string variants will result in a gracefully degraded result: unicode to pathname to octet-string.
Interpolated Strings¶
From time to time it is convenient to want to expand references to variables inside a string. There is a special reader form for such interpolated strings:
#S{...${expr}...}
Here, everything between the outermost matching {
and }
are
scanned for instances of the interpolation sigil, $
. A matching
set of {
and }
is read in and the expression therein is
evaluated, the result being converted to a string (if required) and
replacing the interpolated expression. The rest of the string is
added in a similar way.
If you want to embed an actual interpolation sigil, $
, you can
escape it with the default escape character \
:
#S{Your \$PATH will be '${(frob-path)}'!}
Whatever the call to frob-path
returns will be converted to a
string (if necessary) giving a string equivalent to:
"Your $PATH will be '...'!"
In this particular case, there’s little advantage over using sprintf etc. but in code generation it is much more convenient to see (pre-)constructed variable references in situ in the expected output.
There are two options you can pass, between the #S
and opening
brace: an alternative interpolation sigil and an alternative escape
character.
In effect, normal behaviour is:
#S$\{...}
If you only want to change the escape character, use .
for the
interpolation sigil – which implies that the interpolation sigil
cannot be .
.
If the use of braces, {
and }
, means you would need to escape
braces within the interpolated string a lot you can use parenthesis or
brackets as the delimiting pair:
; generate some C code
printf #S[
if ($condition) {
doit(${c-name arg1}, ${c-name arg2});
}
]
although note that you can only use braces for the expression delimiters.
String Predicates¶
- function string? o¶
test if o is an string
- Param o:
object to test
- Return:
#t
if o is an string,#f
otherwise
- function pathname? o¶
test if o is an pathname
- Param o:
object to test
- Return:
#t
if o is an pathname,#f
otherwise
Note
type->string will report a pathname as a string.
- function octet-string? o¶
test if o is an octet string
- Param o:
object to test
- Return:
#t
if o is an octet string,#f
otherwise
Note
type->string will report an octet-string as a string.
String Constructors¶
- function make-string size [fillc]¶
create a string with an initial length of size
- Param size:
initial string size
- Type size:
integer
- Param fillc:
fill character value, defaults to
#\{space}
- Type fillc:
unicode, optional
- Return:
the new string
- Rtype:
string
- function substring s p0 [pn]¶
return a substring of s from position p0 through to but excluding position pn
- Param s:
string
- Type s:
string
- Param p0:
position
- Type p0:
integer
- Param pn:
position, defaults to string length
- Type pn:
integer, optional
- Return:
the substring
- Rtype:
string
If p0 or pn are negative they are considered to be with respect to the end of the string. This can still result in a negative index.
Note
Technically, the return type is a substring but as substrings are indistinct from strings at a user level then a return type of string suffices.
type->string will reveal the difference.
- function list->string l¶
return a string from the list of the Unicode code points in l
- Param l:
list of code points
- Type s:
list
- Return:
string
- Rtype:
string
- function symbol->string s¶
convert symbol s into a string
- Param s:
symbol to convert
- Type s:
symbol
- Return:
string
- Rtype:
string
- function keyword->string kw¶
convert keyword kw to a string
- Param kw:
keyword to convert
- Type kw:
keyword
- Return:
string
- function string->pathname s¶
return a pathname of the UTF-8 encoding of s
- Param s:
string
- Type s:
string
- Return:
pathname
- Rtype:
pathname
- function string->octet-string s¶
return an octet string of the UTF-8 encoding of s
- Param s:
string
- Type s:
string
- Return:
octet string
- Rtype:
octet string
- function octet-string->string s¶
return a string from the UTF-8 decoding of s
- Param s:
string
- Type s:
octet string
- Return:
string
- Rtype:
string
Warning
This is highly likely to generate #U+FFFD REPLACEMENT CHARACTER in the resultant string.
- function ->string o¶
convert o to a string unless it already is a string
- Param o:
object to convert
- Return:
a string representation of o
->string
differs from string in that it won’t stringify a string!
- function string o¶
convert o to a string
- Param o:
object to convert
- Return:
a string representation of o
String Attributes¶
- function string-length s¶
return the number of code points in s
- Param s:
string
- Type s:
string
- Return:
number of code points
- Rtype:
integer
- function string-ref s index¶
return code point at position index in s
positions start at 0
- Param s:
string
- Type s:
string
- Param index:
position
- Type index:
integer
- Return:
code point
- Rtype:
unicode
- function string-set! s index c¶
set position index of s to c
positions start at 0
- Param s:
string
- Type s:
string
- Param index:
position
- Type index:
integer
- Param c:
code point
- Type c:
unicode
- Return:
#<unspec>
string-set! will fail if c is wider than the existing storage allocation for s
- function string-fill! s fill¶
set all positions of s to fill
- Param s:
string
- Type s:
string
- Param fill:
code point
- Type fill:
unicode
- Return:
#<unspec>
string-fill! will fail if c is wider than the existing storage allocation for s
String Functions¶
- function append-string [args]¶
append strings
- Param args:
strings to append together
- Type args:
list, optional
- Return:
string (”” if no args supplied)
append-string
will gracefully degrade the string variant based on the arguments: unicode > pathname > octet-stringappend-string
takes multiple arguments each of which is a string.See also
concatenate-string which takes a single argument which is a list of strings.
- function concatenate-string ls¶
concatenate strings in list ls
- Param ls:
list of strings to concatenate together
- Type ls:
list, optional
- Return:
string (”” if ls is
#n
)
concatenate-string
takes a single argument, which is a list of strings. It is roughly comparable toapply append-string ls
See also
append-string takes multiple arguments each of which is a string.
- function copy-string s¶
return a copy of s which is not
eq?
to s- Param s:
string
- Type s:
string
- Return:
string
- Rtype:
string
- function join-string delim args¶
return a string of args interspersed with delim
- Param delim:
string
- Type delim:
string
- Param args:
string(s) to be joined
- Type args:
list, optional
- Return:
string (”” if args is
#n
)
- function string-index s c¶
return the index of c in s or
#f
- Param s:
string
- Type s:
string
- Param c:
code point
- Type c:
unicode
- Return:
index or
#f
- Rtype:
integer or
#f
- function string-rindex s c¶
return the rightmost index of c in s or
#f
- Param s:
string
- Type s:
string
- Param c:
code point
- Type c:
unicode
- Return:
index or
#f
- Rtype:
integer or
#f
- function fields in¶
split string in using characters from IFS into an array with the first element the original string
- Param in:
string to split
- Type in:
string
- Return:
array (of strings)
Adjacent characters from IFS are considered a single delimiter.
See also
split-string which returns a list
- function split-string in [delim]¶
split string in using characters from delim into a list of strings
- Param in:
string to split
- Type in:
string
- Param delim:
delimiter characters, defaults to IFS
- Type delim:
string, optional
- Return:
list (of strings)
Adjacent characters from delim are considered a single delimiter.
See also
split-string-exactly which treats delim more fastidiously and fields which returns an array
- function split-string-exactly in [delim]¶
split string in using characters from delim into a list of strings
- Param in:
string to split
- Type in:
string
- Param delim:
delimiter characters, defaults to IFS
- Type delim:
string, optional
- Return:
list (of strings)
Adjacent characters from delim are considered separate delimiters.
See also
- function string<? s1 s2 [...]¶
apply the less-than comparator to strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string<?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are less-than the argument to their left otherwise the result is#f
.string<?
converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string<=? s1 s2 [...]¶
apply the less-than-or-equal comparator to strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string<=?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are less-than-or-equal to the argument to their left otherwise the result is#f
.string<=?
converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string=? s1 s2 [...]¶
apply the equality comparator to strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string=?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are equal to the argument to their left otherwise the result is#f
.string=?
converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string>=? s1 s2 [...]¶
apply the greater-than-or-equal comparator to strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string>=?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are greater-than-or-equal to the argument to their left otherwise the result is#f
.string>=?
converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string>? s1 s2 [...]¶
apply the greater-than comparator to strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string>?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are greater-than the argument to their left otherwise the result is#f
.string>?
converts the Idio string to a UTF-8 representation in a C string then uses strncmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string-ci<? s1 s2 [...]¶
apply the less-than comparator to case-insensitive strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string<?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are less-than the argument to their left otherwise the result is#f
.string<?
converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string-ci<=? s1 s2 [...]¶
apply the less-than-or-equal comparator to case-insensitive strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string<=?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are less-than-or-equal to the argument to their left otherwise the result is#f
.string<=?
converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string-ci=? s1 s2 [...]¶
apply the equality comparator to case-insensitive strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string=?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are equal to the argument to their left otherwise the result is#f
.string=?
converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string-ci>=? s1 s2 [...]¶
apply the greater-than-or-equal comparator to case-insensitive strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string>=?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are greater-than-or-equal to the argument to their left otherwise the result is#f
.string>=?
converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function string-ci>? s1 s2 [...]¶
apply the greater-than comparator to case-insensitive strings
- Param s1:
string
- Type s1:
string
- Param s2:
string
- Type s2:
string
- Return:
the result of comparing the arguments
- Rtype:
boolean
string>?
with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is#t
if all subsequent arguments are greater-than the argument to their left otherwise the result is#f
.string>?
converts the Idio string to a UTF-8 representation in a C string then uses strncasecmp(3) to compare using the shorter length string.If the strings are considered equal then the shorter string is considered less than the longer string.
- function strip-string str discard [ends]¶
return a string which is str without leading, trailing (or both) discard characters
- Param str:
string
- Type str:
string
- Param discard:
string
- Type discard:
string
- Param ends:
'left
,'right
(default),'both
or'none
- Type ends:
symbol, optional
- Return:
string
The returned value could be str or a substring of str
Last built at 2024-12-21T07:10:46Z+0000 from 62cca4c (dev) for Idio 0.3.b.6