.. include:: ../../global.rst .. _`string type`: String Type =========== Strings are arrays of Unicode code points efficiently packed into variable-width arrays. Substrings are references into sections of :lname:`Idio` strings but are otherwise handled the same. Pathnames are a subset of strings where the elements of the string are not treated as UTF-8. Any file name value returned from the operating system will be a pathname. Consequently, you cannot directly compare a file name from the file system to a string from your source code. See :ref:`string->pathname pathname>` for a conversion function. There is no reverse function (pathname to string) as there is no encoding in a file name, it is just a sequence of bytes. Reader Form ----------- The input form for a string is the usual ``"..."``, that is a U+0022 (QUOTATION MARK) delimited value. The collected bytes are assumed to be part of a valid UTF-8 sequence. If the byte sequence is invalid UTF-8 you will get the (standard) �, U+FFFD (REPLACEMENT CHARACTER) and the decoding will resume *with the next byte*. This may result in several replacement characters being generated. There are a couple of notes: #. ``\``, U+005C (REVERSE SOLIDUS -- backslash) is the escape character. The obvious character to escape is ``"`` itself allowing you to embed a double-quote symbol in a double-quoted string: ``"hello\"world"``. In the spirit of `C escape sequences `_ :lname:`Idio` also allows: .. csv-table:: Supported escape sequences in strings :header: sequence, (hex) ASCII, description :align: left :widths: auto ``\a``, 07, alert / bell ``\b``, 08, backspace ``\e``, 1B, escape character ``\f``, 0C, form feed ``\n``, 0A, newline ``\r``, 0D, carriage return ``\t``, 09, horizontal tab ``\v``, 0B, vertical tab ``\\``, 5C, backslash ``\x...``, , up to 2 hex digits representing any byte ``\u...``, , up to 4 hex digits representing a Unicode code point ``\U...``, , up to 8 hex digits representing a Unicode code point Any other escaped character results in that character. For ``\x``, ``\u`` and ``\U`` the code will stop consuming code points if it sees one of the usual delimiters or a code point that is not a hex digit: ``"\Ua9 2021"`` silently stops at the SPACE character giving ``"© 2021"`` and, correspondingly, ``"\u00a92021"`` gives ``"©2021"`` as a maximum of 4 hex digits are consumed by ``\u``. ``\x`` is unrestricted (other than between 0x0 and 0xff) and ``\u`` and ``\U`` will have the hex digits converted into UTF-8. Adding ``\x`` bytes into a string is an exercise in due diligence. #. :lname:`Idio` allows multi-line strings: .. code-block:: idio str1 := "Hello World" str2 := "Hello\nWorld" The string constructors for ``str1`` and ``str2`` are equivalent. .. _`pathnames`: Pathnames ^^^^^^^^^ ``%P"..."`` (or matching brackets, ``%P(...)`` or ``%P{...}`` or ``%P[...]`` or, in general, :samp:`%P{c}...{c}`) where the ``...`` is a regular string as above. That's where the :samp:`\\x{HH}` escape for strings comes into its own. If we know that a filename starts with ISO8859-1_'s 0xA9 (the same "character" as ©, U+00A9 (COPYRIGHT SIGN)), as in a literal byte, 0xA9, and not the UTF-8 sequence 0xC2 0xA9, then we can create such a string: ``%P"\xa9..."``. Pathnames, or strings being used as pathnames, with an ASCII NUL (``\x00``) will result in a format error when they are attempted to be used. They are perfectly valid code points for :lname:`Idio` strings but it is not possible to have an ASCII NUL in a :lname:`C` string, being passed to the operating system's API. .. _`octet string`: Octet Strings ^^^^^^^^^^^^^ ``%B"..."`` (or matching brackets, ``%B(...)`` or ``%B{...}`` or ``%B[...]`` or, in general, :samp:`%B{c}...{c}`) where the ``...`` is a regular string as above. .. note:: The name, byte string, seems too overloaded but the nominal reader form, ``%O`` is too easily confused with a putative ``%0``. So we have a mixed result, the name, octet string, with a reader form derived from byte string. Mixing Strings ^^^^^^^^^^^^^^ You can :ref:`append-string ` strings together and :ref:`join-string ` strings with a delimiter but be careful as mixing string variants will result in a gracefully degraded result: `unicode` to `pathname` to `octet-string`. .. _`string interpolation`: Interpolated Strings ^^^^^^^^^^^^^^^^^^^^ From time to time it is convenient to want to expand references to variables inside a string. There is a special reader form for such interpolated strings: ``#S{...${expr}...}`` Here, everything between the outermost matching ``{`` and ``}`` are scanned for instances of the *interpolation sigil*, ``$``. A matching set of ``{`` and ``}`` is read in and the expression therein is evaluated, the result being converted to a string (if required) and replacing the interpolated expression. The rest of the string is added in a similar way. If you want to embed an actual interpolation sigil, ``$``, you can escape it with the default escape character ``\``: ``#S{Your \$PATH will be '${(frob-path)}'!}`` Whatever the call to ``frob-path`` returns will be converted to a string (if necessary) giving a string equivalent to: ``"Your $PATH will be '...'!"`` In this particular case, there's little advantage over using :ref:`sprintf ` etc. but in code generation it is much more convenient to see (pre-)constructed variable references *in situ* in the expected output. There are two options you can pass, between the ``#S`` and opening brace: an alternative interpolation sigil and an alternative escape character. In effect, normal behaviour is: ``#S$\{...}`` If you only want to change the escape character, use ``.`` for the interpolation sigil -- which implies that the interpolation sigil cannot be ``.``. If the use of braces, ``{`` and ``}``, means you would need to escape braces within the interpolated string a lot you can use parenthesis or brackets as the delimiting pair: .. code-block:: idio ; generate some C code printf #S[ if ($condition) { doit(${c-name arg1}, ${c-name arg2}); } ] although note that you can only use braces for the expression delimiters. String Predicates ----------------- .. _`string?`: .. idio:function:: string? o test if `o` is an string :param o: object to test :return: ``#t`` if `o` is an string, ``#f`` otherwise .. _`pathname?`: .. idio:function:: pathname? o test if `o` is an pathname :param o: object to test :return: ``#t`` if `o` is an pathname, ``#f`` otherwise .. note:: :ref:`type->string string>` will report a pathname as a string. .. _`octet-string?`: .. idio:function:: octet-string? o test if `o` is an octet string :param o: object to test :return: ``#t`` if `o` is an octet string, ``#f`` otherwise .. note:: :ref:`type->string string>` will report an octet-string as a string. String Constructors ------------------- .. _`make-string`: .. idio:function:: make-string size [fillc] create a string with an initial length of `size` :param size: initial string size :type size: integer :param fillc: fill character value, defaults to ``#\{space}`` :type fillc: unicode, optional :return: the new string :rtype: string .. _`substring`: .. idio:function:: substring s p0 [pn] return a substring of `s` from position `p0` through to but excluding position `pn` :param s: string :type s: string :param p0: position :type p0: integer :param pn: position, defaults to string length :type pn: integer, optional :return: the substring :rtype: string If `p0` or `pn` are negative they are considered to be with respect to the end of the string. This can still result in a negative index. .. note:: Technically, the return type is a substring but as substrings are indistinct from strings at a user level then a return type of string suffices. :ref:`type->string string>` will reveal the difference. .. _`list->string`: .. idio:function:: list->string l return a string from the list of the Unicode code points in `l` :param l: list of code points :type s: list :return: string :rtype: string .. _`symbol->string`: .. idio:function:: symbol->string s convert symbol `s` into a string :param s: symbol to convert :type s: symbol :return: string :rtype: string .. _`keyword->string`: .. idio:function:: keyword->string kw convert keyword `kw` to a string :param kw: keyword to convert :type kw: keyword :return: string .. _`string->pathname`: .. idio:function:: string->pathname s return a pathname of the UTF-8 encoding of `s` :param s: string :type s: string :return: pathname :rtype: pathname .. _`string->octet-string`: .. idio:function:: string->octet-string s return an octet string of the UTF-8 encoding of `s` :param s: string :type s: string :return: octet string :rtype: octet string .. _`octet-string->string`: .. idio:function:: octet-string->string s return a string from the UTF-8 decoding of `s` :param s: string :type s: octet string :return: string :rtype: string .. warning:: This is highly likely to generate #U+FFFD REPLACEMENT CHARACTER in the resultant string. .. _`->string`: .. idio:function:: ->string o convert `o` to a string unless it already is a string :param o: object to convert :return: a string representation of `o` ``->string`` differs from :ref:`string ` in that it won't stringify a string! .. _`string`: .. idio:function:: string o convert `o` to a string :param o: object to convert :return: a string representation of `o` String Attributes ----------------- .. _`string-length`: .. idio:function:: string-length s return the number of code points in `s` :param s: string :type s: string :return: number of code points :rtype: integer .. _`string-ref`: .. idio:function:: string-ref s index return code point at position `index` in `s` positions start at 0 :param s: string :type s: string :param index: position :type index: integer :return: code point :rtype: unicode .. _`string-set!`: .. idio:function:: string-set! s index c set position `index` of `s` to `c` positions start at 0 :param s: string :type s: string :param index: position :type index: integer :param c: code point :type c: unicode :return: ``#`` `string-set!` will fail if `c` is wider than the existing storage allocation for `s` .. _`string-fill!`: .. idio:function:: string-fill! s fill set all positions of `s` to `fill` :param s: string :type s: string :param fill: code point :type fill: unicode :return: ``#`` `string-fill!` will fail if `c` is wider than the existing storage allocation for `s` String Functions ---------------- .. _`append-string`: .. idio:function:: append-string [args] append strings :param args: strings to append together :type args: list, optional :return: string ("" if no `args` supplied) ``append-string`` will gracefully degrade the string variant based on the arguments: `unicode` > `pathname` > `octet-string` ``append-string`` takes multiple arguments each of which is a string. .. seealso:: :ref:`concatenate-string ` which takes a single argument which is a list of strings. .. _`concatenate-string`: .. idio:function:: concatenate-string ls concatenate strings in list `ls` :param ls: list of strings to concatenate together :type ls: list, optional :return: string ("" if `ls` is ``#n``) ``concatenate-string`` takes a single argument, which is a list of strings. It is roughly comparable to .. code-block:: idio apply append-string ls .. seealso:: :ref:`append-string ` takes multiple arguments each of which is a string. .. _`copy-string`: .. idio:function:: copy-string s return a copy of `s` which is not ``eq?`` to `s` :param s: string :type s: string :return: string :rtype: string .. _`join-string`: .. idio:function:: join-string delim args return a string of `args` interspersed with `delim` :param delim: string :type delim: string :param args: string(s) to be joined :type args: list, optional :return: string ("" if `args` is ``#n``) .. _`string-index`: .. idio:function:: string-index s c return the index of `c` in `s` or ``#f`` :param s: string :type s: string :param c: code point :type c: unicode :return: index or ``#f`` :rtype: integer or ``#f`` .. _`string-rindex`: .. idio:function:: string-rindex s c return the rightmost index of `c` in `s` or ``#f`` :param s: string :type s: string :param c: code point :type c: unicode :return: index or ``#f`` :rtype: integer or ``#f`` .. _`fields`: .. idio:function:: fields in split string `in` using characters from :ref:`IFS ` into an array with the first element the original string :param in: string to split :type in: string :return: array (of strings) Adjacent characters from :var:`IFS` are considered a single delimiter. .. seealso:: :ref:`split-string ` which returns a list .. _`split-string`: .. idio:function:: split-string in [delim] split string `in` using characters from `delim` into a list of strings :param in: string to split :type in: string :param delim: delimiter characters, defaults to :ref:`IFS ` :type delim: string, optional :return: list (of strings) Adjacent characters from `delim` are considered a single delimiter. .. seealso:: :ref:`split-string-exactly ` which treats `delim` more fastidiously and :ref:`fields ` which returns an array .. _`split-string-exactly`: .. idio:function:: split-string-exactly in [delim] split string `in` using characters from `delim` into a list of strings :param in: string to split :type in: string :param delim: delimiter characters, defaults to :ref:`IFS ` :type delim: string, optional :return: list (of strings) Adjacent characters from `delim` are considered separate delimiters. .. seealso:: :ref:`split-string ` .. _`string=?`: .. idio:function:: string>=? s1 s2 [...] apply the greater-than-or-equal comparator to strings :param s1: string :type s1: string :param s2: string :type s2: string :return: the result of comparing the arguments :rtype: boolean ``string>=?`` with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is ``#t`` if all subsequent arguments are greater-than-or-equal to the argument to their left otherwise the result is ``#f``. ``string>=?`` converts the :lname:`Idio` string to a UTF-8 representation in a :lname:`C` string then uses :manpage:`strncmp(3)` to compare using the shorter length string. If the strings are considered equal then the shorter string is considered less than the longer string. .. _`string>?`: .. idio:function:: string>? s1 s2 [...] apply the greater-than comparator to strings :param s1: string :type s1: string :param s2: string :type s2: string :return: the result of comparing the arguments :rtype: boolean ``string>?`` with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is ``#t`` if all subsequent arguments are greater-than the argument to their left otherwise the result is ``#f``. ``string>?`` converts the :lname:`Idio` string to a UTF-8 representation in a :lname:`C` string then uses :manpage:`strncmp(3)` to compare using the shorter length string. If the strings are considered equal then the shorter string is considered less than the longer string. .. _`string-ci=?`: .. idio:function:: string-ci>=? s1 s2 [...] apply the greater-than-or-equal comparator to case-insensitive strings :param s1: string :type s1: string :param s2: string :type s2: string :return: the result of comparing the arguments :rtype: boolean ``string>=?`` with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is ``#t`` if all subsequent arguments are greater-than-or-equal to the argument to their left otherwise the result is ``#f``. ``string>=?`` converts the :lname:`Idio` string to a UTF-8 representation in a :lname:`C` string then uses :manpage:`strncasecmp(3)` to compare using the shorter length string. If the strings are considered equal then the shorter string is considered less than the longer string. .. _`string-ci>?`: .. idio:function:: string-ci>? s1 s2 [...] apply the greater-than comparator to case-insensitive strings :param s1: string :type s1: string :param s2: string :type s2: string :return: the result of comparing the arguments :rtype: boolean ``string>?`` with more than one argument (a minimum of two) has each subsequent argument compared to the one to its left. The result is ``#t`` if all subsequent arguments are greater-than the argument to their left otherwise the result is ``#f``. ``string>?`` converts the :lname:`Idio` string to a UTF-8 representation in a :lname:`C` string then uses :manpage:`strncasecmp(3)` to compare using the shorter length string. If the strings are considered equal then the shorter string is considered less than the longer string. .. _`strip-string`: .. idio:function:: strip-string str discard [ends] return a string which is `str` without leading, trailing (or both) `discard` characters :param str: string :type str: string :param discard: string :type discard: string :param ends: ``'left``, ``'right`` (default), ``'both`` or ``'none`` :type ends: symbol, optional :return: string The returned value could be `str` or a substring of `str` .. include:: ../../commit.rst