POSIX regex¶
Idio uses the POSIX regex(7) regular expression primitives regcomp and regexec. These are combined in the function regex-matches.
Slightly better for use in loops is the template regex-case which works like a simplified cond except the clause “conditions” are regular expressions to be matched.
regex-case
then supplies the consequent block with the result of
the call to regexec
as the variable r. As such r.0
is the whole of the matched string, r.1 is the first matched
sub-expression, r.2 the second matched sub-expression, etc..
Similarly, pattern-case provides something like
the shell’s Pattern Matching where *
and ?
are really .*
and .
respectively. In particular, see regex-pattern-string for how how the string is processed.
- function regcomp rx [flags]¶
POSIX regex(3)
compile the regular expression in rx suitable for subsequent use in regexec
The flags are:
REG_EXTENDED
REG_ICASE
REG_NOSUB
(ignored)REG_NEWLINE
This code defaults to
REG_EXTENDED
so there is an extraREG_BASIC
flag to disableREG_EXTENDED
- Param rx:
regular expression
- Type rx:
string
- Param flags:
regcomp flags
- Type flags:
list of symbols
- Return:
compiled regex(3)
- Rtype:
C/pointer
- function regexec rx str [flags]¶
POSIX regex(3)
match the regular expression in rx against the string str where rx was compiled using regcomp
The flags are:
REG_NOTBOL
REG_NOTEOL
REG_STARTEND
(if supported, see below)REG_VERBOSE
return verbose resultsOn a successful match an array of the subexpressions in rx is returned with the first (zero-th) being the entire matched string.
If a subexpression in rx matched the corresponding array element will be the matched string.
If a subexpression in rx did not match the corresponding array element will be
#f
.- Param rx:
compiled regular expression
- Type rx:
C/pointer
- Param str:
string to match against
- Type rx:
string
- Param flags:
regexec flags
- Type flags:
list of symbols
- Return:
see below
- Rtype:
array or
#f
By default regexec returns an array of matching subexpressions or
#f
for no match.If
REG_VERBOSE
is passed in flags then each element of the array is a list of the matched sub-expression, its starting offset and its ending offset plus one (suitable for substring).REG_STARTEND
(if supported) is a valid C flag and accepted here but is ignored as there is no means to pre-supplypmatch[0]
(see regexec(3)).
- function regex-matches rx str¶
does rx match str?
- Param rx:
regular expression
- Type rx:
string
- Param str:
string to match against
- Type str:
string
- Return:
see regexec
- template regex-case e [clauses]¶
regex-case
works like a simplified cond where e is the string to be matched against and the “conditions” in each clause are the regular expressions to test with.- Param e:
the string to be matched against
- Type e:
string
- Param clauses:
clauses like
("regex" expr)
- Return:
whatever any matched clause’s consequent expression returns.
e will be evaluated and should return a string.
If the regular expression matches then the consequent expression is treated like an implict
=>
clause where the supplied parameter is r.Thus r.0 represents the whole of the matched string, r.1 the first matched sub-expression, r.2 the second matched sub-expression, etc..
- Example:
Suppose we want to match a common
var=value
assignment:(regex-case (read-line) ("^([[:alpha:]][[:alnum:]_]*)=(.*)" { printf "%s is '%s'\n" r.1 r.2 }))
Note
regex-case
stashes the compiled regular expression for literal strings in a global table. This means that in loops the regular expression doesn’t need to be recompiled. It also means the compiled regular expressions are not reaped until Idio exits.See also
- function regex-exact-string str¶
Return a regcomp(3)-safe version of str
- Param str:
string to make safe
- Type str:
string
- Return:
regcomp-safe string
- Rtype:
string
In particular, code points in the set
$^.[()|*+?{
(see regex(7)) are escaped.
- function regex-pattern-string str¶
Return a Pattern Matching version of str
- Param str:
string to convert
- Type str:
string
- Return:
pattern-like string
- Rtype:
string
In particular:
*
is replaced with.*
?
is replaced with.
(simple) bracket expressions are allowed with optional
*
,+
or?
qualifiersA simple bracket expression is one with no collating elements (eg.
[:alpha:]
) or at most one collating element so long as it is the last element of the bracket expression list..^$|+
are (otherwise) escaped and become literals{
is escaped and is a literal therefore bounds ({n,m}
) are not allowed()
are escaped and are literals therefore sub-expressions are not allowed
- template pattern-case e [clauses]¶
pattern-case
works like a simplified cond where e is the string to be matched against and the “conditions” in each clause are the pattern matches to test with.- Param e:
the string to be matched against
- Type e:
string
- Param clauses:
clauses like
("pattern-match" expr)
- Return:
whatever any matched clause’s consequent expression returns.
e will be evaluated and should return a string.
Here, pattern matches have regex-pattern-string applied, are anchored to the entire string and the code continues like regex-case. The string manipulation is like:
sprintf "^%s$" (regex-pattern-string pattern-match)
If the pattern matches then the consequent expression is treated like an implict
=>
clause where the supplied parameter is r.Thus r.0 represents the whole of the matched string, r.1 the first matched sub-expression, r.2 the second matched sub-expression, etc..
- Example:
Suppose we want an unreliable method to determine if this is a BSD-style operating system:
(pattern-case (collect-output uname -s) ("*BSD" { printf "%s is a BSD\n" r.0 }))
Note
pattern-case
stashes the compiled regular expression for literal strings in a global table. This means that in loops the regular expression doesn’t need to be recompiled. It also means the compiled regular expressions are not reaped until Idio exits.See also
Last built at 2024-10-13T06:10:47Z+0000 from 62cca4c (dev) for Idio 0.3.b.6