The spaces problem, part 2: workarounds and their disadvantages

The spaces problem, part 2: workarounds and their disadvantages

Author — Saturn 2006/10/24 20:45

Although preserving spaces is possible to a certain extent, one of the main problems is that displaying text including all is not really possible without the help of a DLL. As a result, people have been working around the problem for many years, using hack'ish methods to achieve nearly the same. This article tries to give an overview of these methods, and the disadvantages of each of them. The focus here is on displaying text in the main display areas, e.g. in channel windows.

Bold pairs

One method of preventing mIRC from removing multiple consecutive spaces, is to separate them by inserting invisible codes between each pair of consecutive spaces. Although two bold characters are typically used for that purpose, a pair of any other self-reversing control code would do as well, such as reverse or underline. This is a rather simple alias to insert bold characters in an input string that contains multiple consecutive spaces:

alias insertbold return $regsubex($1-,/(?<= )(?= )/g,$chr(2) $+ $chr(2))

One of the main disadvantages of using bold characters is how it affects display drawing speed. mIRC's text drawing routine draws as many characters as

possible with one drawing call, but it is forced to issue a new call each time it finds a non-displayable character in the input text. All control codes are obviously non-displayable characters. Therefore, the bold characters of this method force mIRC to draw each space character individually, and this is actually noticeable in drawing speed if there are lots of multiple consecutive spaces on the current screen.

Another potential danger of this method is that each extra space takes three bytes. Given that for example a line of text from IRC could potentially contain mostly spaces, insertion of such pairs of bold characters may cause the line length to triple and thereby exceed mIRC's maximum line length limit.

An advantage of using bold characters is that normal marking of the text on a window will copy a string with the original spaces on the clipboard, suitable for pasting into other environments. Of course this does not apply if the user preserves control codes by holding ctrl while copying text.

On the other hand: pretty much anywhere outside the display area, the bold characters will show up as blocks, so this method is definitely not suitable for e.g. dialogs.

Hard spaces

A different and perhaps more widely used approach is to replace multiple consecutive (normal) spaces with hard spaces - a hard space is basically a character that looks like a space, but isn't interpreted as a space by applications and is therefore also not affected by mIRC's way of manipulating spaces. The hard space character has ASCII code 160, and therefore commonly referred to as "$chr(160)".

The main disadvantage of $chr(160) is that this character is not universally recognized as a hard space: people using different codepages, fonts or operating systems may not see $chr(160) as a space. Additionally, when copying the text, the characters will remain hard spaces even in contexts where this may not be desired, such as dialogs, text files..

Another disadvantage of hard spaces lies in the fact that they are indeed "hard", with as result that mIRC's text display code does not see them as space, and will act differently in regard to, for example, text wrapping. A solution to this is to alternate between normal spaces and hard spaces, instead of simply replacing all normal spaces with hard spaces. For example with this alias:

alias hardspaces {
  return $regsubex($1-,/( +)(?= )/g,$str($chr(32) $+ $&
    $chr(160),$calc($len(\t) /2)) $+ $iif(2 \\ $len(\t),$chr(160)))
}

Which will yield the following kinds of output ($chr(160) represented with underscore) - note how it makes sure there is always a normal space before the second word, so that wrapped words don't start with a hard space:

a b
a_ b
a _ b
a __ b
a _ _ b
a _ __ b
a _ _ _ b
a _ _ __ b
a _ _ _ _ b

Discussion

An alternative implementation of the above functionality is:

alias hardspaces-alt {
  return $regsubex($1,/(?:(( )+)\1|( )())(?= )/g,$str(\2 $+ $chr(160),$len(\1)))
}

In this approach, usage of mIRC code in the <subtext> parameter is minimized, with the downside that the regular expression became more complex. So which is faster? The answer depends on the input strings. Generally, $hardspaces-alt becomes faster as the number of groups of multiple consecutive spaces increases. For example, if a string contains 50 groups of triple-spaces (3 spaces in a row), $hardspaces-alt is more than 100% faster than $hardspaces (because the latter evaluates more mIRC code per group). Also, $hardspaces-alt becomes slower as the total lengths of groups of spaces increase (because of its more complex regex). For example, with a string containing two groups of 200-spaces, $hardspaces is about 25% faster than $hardspaces-alt.

Conclusion: the optimal choice should be based on the structure of the expected input and, ideally, the results of further benchmarks by the scripter. In a somewhat common scenario, that is a string within the limits of an IRC message (shorter than around 500 characters) containing 3 groups of 5-spaces, $hardspaces-alt is about 70% faster. Needless to say that all these numbers are based on tests on my computer with strings I thought were representative, so the interested reader is encouraged to perform their own benchmarks and discuss any differences here. — qwerty

Unicode

A relatively new alternative to hard spaces is the use of unicode; as characters can also be encoded as UTF-8 unicode sequences, it is possible to use this to avoid mIRC's space stripping as well.

Unfortunately the normal space character cannot be encoded this way - it would be the UTF-8 sequence "192 160", but mIRC (rightly) does not accept UTF-8 sequences that represent 7-bit characters. The hard space can be encoded this way though - as the UTF-8 sequence "194 160". This has the major advantage that it is the standardized equivalent of the hard space, and as such it will be interpreted correctly even with less standard codepages, fonts and operating environments, and can therefore be used in communication as well. This comes at the cost of two bytes per space character instead of one (once again bringing along a risk of exceeding the line length limit), as well as

incompatibility with clients that do not support unicode (e.g. older mIRC versions) or are set not to interpret it. The problem of text wrapping remains, even if you use alternative UTF-8 encoded soft spaces (for example U+2007 which is "226 128 135") as mIRC does not recognize these as soft anyway.

Similarly, an alternative to the double-bold approach is the use of a unicode character with no width, for example U+2060 which is UTF-8 sequence "226 129 160". As this is one byte more than the double-bold trick (resulting in four bytes for each additional space), the merits of this approach are questionable.

; usage: $utfseq(226 129 160)
alias utfseq return $regsubex($1-,/(\d+)\s*/g,$chr(\t))

Conclusion

All of the methods above have significant downsides, and many of these downsides are rather significant as well. As long as there is no proper built-in way of displaying text while preserving spaces, it is likely that scripters will continue to use the methods listed above in the future - at least they can now do so while being aware of the problems.