Various regex facts and findings

Using \t, \n & co in one-line nested regsubex calls

Nesting regsubex calls can be very useful for modifying parts of a subtext within certain boundaries of the original text. For example, imagine you want to change all a's and b's between X and Y pairs in a string into uppercase, while there can be multiple X-Y pairs, multiple a's and b's (in any order) and any other letters within the string. The approach would be to use two nested regsubex calls, having the outer one extract the text for replacement, and having the inner one perform the actual replacement. Doing all this on one line raises the issue: how to use fields like \t and \n in the <sub> field of the inner regsubex? Just specifying \t or \n doesn't work on the same line, and without those it's impossible to find out what's being replaced.

As it turns out, the problem is that mIRC does a search-and-replace on \t, \n (as well as \1, \2, \a etc) before evaluation, and therefore can't tell which \t and \n is part of any inner regsubex call. As a result, the \t and \n for the inner regsubex are replaced with the contents of the \t and \n of the outer regsubex, which obviously results in wrong output. The solution is to use the construction [[ \ $+ t ]] (and similar) in the <sub> field of the inner regsubex:

$regsubex(acaaXbaacababacaYacbbbbcccbaabbXbcccbaaacbbbcYabcac,/(X[^Y]*Y)/g,$regsubex(x,\t,/([ab])/g,$upper( [[ \ $+ t ]] )))

This, and only this results in the correct output:

acaaXBAAcABABAcAYacbbbbcccbaabbXBcccBAAAcBBBcYabcac

Putting the whole inner regsubex call into a custom identifier and calling $that_identifier(\t) from the outer regsubex's <sub> field works just as well. Remember to give the regsubex calls unique names (I used 'x' for the inner one and an empty name for the outer one here) or things will go horribly wrong :)

Tested on mIRC 6.21.

Saturn 2007/04/18 17:20

Discussion

Example breakdown

If you're wondering why only the above construction works, it may help to re-read jaytea's evaluation brackets article, specifically the part about brackets inside identifiers. As mentioned there, [[ and ]] are processed at the same stage as single [ ], ie before any actual identifier/variable evaluation. To understand what comes next, let's first have a brief look at how mIRC evaluates an identifier:

  1. Processes [ ] (pre-evaluating any variables/identifiers inside them) and [[ ]] (turning them into [ ])
  2. Separates the identifier's parameters and evaluates each parameter (in left-to-right order).
  3. Computes the result of the application of the identifier to those (evaluated) parameters.

In $regsubex(), the situation is slightly different; without knowing what's really going on behind the scenes, the process to us looks like this:

  1. mIRC processes square brackets
  2. Evaluates the first two parameters
  3. Performs the regex match
  4. Evaluates the third parameter (one or more times, if /g was used).
  5. Performs the substitutions and returns the result.

Now let's analyze Saturn's example. mIRC first turns [[ ]] into single [ ] and then starts evaluating the outer $regsubex's parameters. When it gets to the third parameter (ie <sub>), it first performs the aforementioned search-and-replace to \t, \n, \1, \2 etc 1). This replaces \t in the <text> parameter of the inner $regsubex with its value (or rather its internal representation). It then attempts evaluation of the inner $regsubex, ie spawns a new instance of its script evaluation procedure. But thanks to the previous step, that procedure is now called to evaluate

$regsubex(x,<value of outer \t>,/([ab])/g,$upper( [ \ $+ t ] ))

The usual rules apply here; [ ] are processed first and any code inside them is pre-evaluated, so \ $+ t becomes \t. After the [] pre-processing, mIRC proceeds with the search-and-replace of \t, \1 etc, so the \t that was just constructed is given its value; this value comes from the regex matches of the inner $regsubex, since we are now in its context.

One can thus see why only the double-brackets construction works; by controlling the evaluation order, we can have the second \t constructed after the search-and-replace of the outer $regsubex but before the search-and-replace of the inner $regsubex.

Related case

You noticed above that \t in the <text> parameter of the inner $regsubex works fine (evaluating to the match from the outer $regsubex). Can you use \t & co in the <sub> parameter of the inner $regsubex (to refer to the outer \t again)? The answer is no, and here's why: as we saw above, the <sub> parameter of $regsubex is evaluated after the regex match has been performed and the appropriate internal structures have been set. mIRC's search-and-replace replaces \t in the inner <sub> with its internal representation, hereafter called <t>. The problem is that <t> seems to be 'reset' by the inner $regsubex; in the context of the latter, <t> does not represent anything. The workaround is again [[ ]]; enclosing \t in them pre-evaluates its internal representation, so the inner $regsubex only sees the actual value of \t. A real-world (I needed such a functionality at one point) example is the following:

$regsubex(5-7 10 14-18 20 23-29,/(\d+)-(\d+)/g,$regsubex(x,$str(.,$calc(\2 - \1 + 1)),/./g,$calc( [[ \1 + \ $+ n ]] - 1) $chr(32)))

What this does is expand each number range in the input to a space-separated list of consecutive integers. For this to work, matches from the outer $regsubex need to be used in the <sub> parameter of the inner $regsubex.

The attentive reader will have noticed a problem with this; if \t's value includes commas, parentheses etc, a syntax error will occur, for the same reason

var %a = a,b | echo -ag $upper( [ %a ] )

generates an error. So unless you are sure that your input will not contain those special characters, you are better off with an alias.

qwerty 2007/04/18 23:00


1) In reality those sequences are not directly replaced by their value (this would cause problems with special characters like commas, opening/closing parentheses etc) but by an intermediate internal representation; $1, $2 etc are involved in this, although the exact mechanism is unclear to me.
regex.txt · Last modified: 2011/10/17 23:51 (external edit)
 
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki