POSIX regex

Idio uses the POSIX regex(7) regular expression primitives regcomp and regexec. These are combined in the function regex-matches.

Slightly better for use in loops is the template regex-case which works like a simplified cond except the clause “conditions” are regular expressions to be matched.

regex-case then supplies the consequent block with the result of the call to regexec as the variable r. As such r.0 is the whole of the matched string, r.1 is the first matched sub-expression, r.2 the second matched sub-expression, etc..

Similarly, pattern-case provides something like the shell’s Pattern Matching where * and ? are really .* and . respectively. In particular, see regex-pattern-string for how how the string is processed.

function regcomp rx [flags]

POSIX regex(3)

compile the regular expression in rx suitable for subsequent use in regexec

The flags are: REG_EXTENDED REG_ICASE REG_NOSUB (ignored) REG_NEWLINE

This code defaults to REG_EXTENDED so there is an extra REG_BASIC flag to disable REG_EXTENDED

Param rx:

regular expression

Type rx:

string

Param flags:

regcomp flags

Type flags:

list of symbols

Return:

compiled regex(3)

Rtype:

C/pointer

function regexec rx str [flags]

POSIX regex(3)

match the regular expression in rx against the string str where rx was compiled using regcomp

The flags are: REG_NOTBOL REG_NOTEOL REG_STARTEND (if supported, see below)

REG_VERBOSE return verbose results

On a successful match an array of the subexpressions in rx is returned with the first (zero-th) being the entire matched string.

If a subexpression in rx matched the corresponding array element will be the matched string.

If a subexpression in rx did not match the corresponding array element will be #f.

Param rx:

compiled regular expression

Type rx:

C/pointer

Param str:

string to match against

Type rx:

string

Param flags:

regexec flags

Type flags:

list of symbols

Return:

see below

Rtype:

array or #f

By default regexec returns an array of matching subexpressions or #f for no match.

If REG_VERBOSE is passed in flags then each element of the array is a list of the matched sub-expression, its starting offset and its ending offset plus one (suitable for substring).

REG_STARTEND (if supported) is a valid C flag and accepted here but is ignored as there is no means to pre-supply pmatch[0] (see regexec(3)).

function regex-matches rx str

does rx match str?

Param rx:

regular expression

Type rx:

string

Param str:

string to match against

Type str:

string

Return:

see regexec

template regex-case e [clauses]

regex-case works like a simplified cond where e is the string to be matched against and the “conditions” in each clause are the regular expressions to test with.

Param e:

the string to be matched against

Type e:

string

Param clauses:

clauses like ("regex" expr)

Return:

whatever any matched clause’s consequent expression returns.

e will be evaluated and should return a string.

If the regular expression matches then the consequent expression is treated like an implict => clause where the supplied parameter is r.

Thus r.0 represents the whole of the matched string, r.1 the first matched sub-expression, r.2 the second matched sub-expression, etc..

Example:

Suppose we want to match a common var=value assignment:

(regex-case (read-line)
  ("^([[:alpha:]][[:alnum:]_]*)=(.*)" {
    printf "%s is '%s'\n" r.1 r.2
  }))

Note

regex-case stashes the compiled regular expression for literal strings in a global table. This means that in loops the regular expression doesn’t need to be recompiled. It also means the compiled regular expressions are not reaped until Idio exits.

See also

pattern-case

function regex-exact-string str

Return a regcomp(3)-safe version of str

Param str:

string to make safe

Type str:

string

Return:

regcomp-safe string

Rtype:

string

In particular, code points in the set $^.[()|*+?{ (see regex(7)) are escaped.

function regex-pattern-string str

Return a Pattern Matching version of str

Param str:

string to convert

Type str:

string

Return:

pattern-like string

Rtype:

string

In particular:

  • * is replaced with .*

  • ? is replaced with .

  • (simple) bracket expressions are allowed with optional *, + or ? qualifiers

    A simple bracket expression is one with no collating elements (eg. [:alpha:]) or at most one collating element so long as it is the last element of the bracket expression list.

  • .^$|+ are (otherwise) escaped and become literals

  • { is escaped and is a literal therefore bounds ({n,m}) are not allowed

  • () are escaped and are literals therefore sub-expressions are not allowed

template pattern-case e [clauses]

pattern-case works like a simplified cond where e is the string to be matched against and the “conditions” in each clause are the pattern matches to test with.

Param e:

the string to be matched against

Type e:

string

Param clauses:

clauses like ("pattern-match" expr)

Return:

whatever any matched clause’s consequent expression returns.

e will be evaluated and should return a string.

Here, pattern matches have regex-pattern-string applied, are anchored to the entire string and the code continues like regex-case. The string manipulation is like:

sprintf "^%s$" (regex-pattern-string pattern-match)

If the pattern matches then the consequent expression is treated like an implict => clause where the supplied parameter is r.

Thus r.0 represents the whole of the matched string, r.1 the first matched sub-expression, r.2 the second matched sub-expression, etc..

Example:

Suppose we want an unreliable method to determine if this is a BSD-style operating system:

(pattern-case (collect-output uname -s)
  ("*BSD" {
    printf "%s is a BSD\n" r.0
  }))

Note

pattern-case stashes the compiled regular expression for literal strings in a global table. This means that in loops the regular expression doesn’t need to be recompiled. It also means the compiled regular expressions are not reaped until Idio exits.

See also

regex-case

Last built at 2024-10-13T06:10:47Z+0000 from 62cca4c (dev) for Idio 0.3.b.6