RegexSearch - robmcmullen/peppy GitHub Wiki

Regex Search and Replace

Only very simple regular expression matching is possible in the StyledTextCtrl; for full python-style regular expressions, the text must be matched using python strings outside of the STC. For example, from the initial proof-of-concept implementation, the whole STC can be converted to a python string and the regex can be matched against that:

#!python
        text = self.stc.GetTextRange(start, self.stc.GetTextLength())
        index = 0
        
        match = self.regex.search(text, index)
        if match:
            pos = start + len(text[:match.start(0)].encode('utf-8'))
            count = len(text[match.start(0):match.end(0)].encode('utf-8'))
            self.highlightSelection(pos, count)

but this is slow as you have to convert the STC to a python string for every match attempt. With a very long file, this can cause noticeable slowdowns. A more efficient means is needed.

Shadow text

A shadow copy of the text in the stc can be created. You can't do this in doFindNext by itself because it doesn't know when to get a new copy or when to use the existing copy. You can guarantee, however, that every time setFlags is called, the user has made some change to either the text or the search string. We don't really care if the search string is changed, but we do want to hook into when the text has changed.

The shadow copy keeps a pristine version of the text before any replaces have been made. Once replacements have been made, however, the length of the text contained in the stc will change with respect to the shadow copy. So, we need to maintain a relative index that shows the current position in the shadow as it relates to the current position in the stc. Here's the shadow index compared to the stc index after replacing the first "yy" with "ZZZZ":

shadow:      xxxxxxxxxxxyyxxxxxxyyxxxxyyxxxxyyxxxxyyxxxxyyxxxx
shadow index:             ^
stc:         xxxxxxxxxxxZZZZxxxxxxyyxxxxyyxxxxyyxxxxyyxxxxyyxxxx
stc index:                  ^
after several replacements:
shadow:      xxxxxxxxxxxyyxxxxxxyyxxxxyyxxxxyyxxxxyyxxxxyyxxxx
shadow index:                                 ^
stc:         xxxxxxxxxxxZZZZxxxxxxZZZZxxxxZZZZxxxxZZZZxxxxyyxxxxyyxxxx
stc index:                                            ^
This is additionally complicated by the fact that unicode characters can be longer than one byte, and while python strings consider each character occupying one position, the STC counts positions by byte.

So, after every replacement, the stc length will change, and the equivalent positions in both shadow and stc must be updated.

Here's the relevant code from FindRegexService in find_replace.py

#!python
    def verifyShadow(self, start=-1, incremental=False):
        if self.shadow is None or start >= 0:
            if start < 0:
                sel = self.stc.GetSelection()
                if incremental:
                    start = min(sel)
                else:
                    start = max(sel)
            self.shadow = self.stc.GetTextRange(start, self.stc.GetTextLength())
            self.shadow_equiv_pos = 0
            self.stc_equiv_start = start
            self.stc_equiv_pos = start

    def doFindNext(self, start=-1, incremental=False):
        if not self.settings.find:
            return None, None
        self.getFlags()
        if self.regex is None:
            return _("Incomplete regex"), None
        
        self.verifyShadow(start, incremental)
        
        match = self.regex.search(self.shadow, self.shadow_equiv_pos)
        if match:
            # Because unicode characters are stored as utf-8 in the stc and the
            # positions in the stc correspond to the raw bytes, not the number
            # of unicode characters, we have to find out the offset to the
            # unicode chars in terms of raw bytes.
            pos = self.stc_equiv_pos + len(self.shadow[self.shadow_equiv_pos:match.start(0)].encode('utf-8'))
            count = len(self.shadow[match.start(0):match.end(0)].encode('utf-8'))
            self.stc_equiv_start = pos
            self.stc_equiv_pos = pos + count
            
            self.shadow_equiv_pos = match.end(0)
            
            dprint("match=%s shadow: (%d-%d) equiv=%d, stc: (%d-%d) equiv=%d" % (match.group(0), match.start(0), match.end(0), self.shadow_equiv_pos, pos, pos+count, self.stc_equiv_pos))
            self.stc.SetSelection(self.stc_equiv_start, self.stc_equiv_pos)
        
        else:
            pos = -1
        
        return pos, start
    
    def doReplace(self):
        self.verifyShadow()
        
        # We assume that doFindNext has been called, setting up the equivalent
        # start and end positions
        replacing = self.stc.GetTextRange(self.stc_equiv_start, self.stc_equiv_pos)
        try:
            replacement = self.regex.sub(self.settings.replace, replacing)
        except re.error, e:
            raise ReplacementError("Regex error: %s" % e)
        
        self.stc.SetTargetStart(self.stc_equiv_start)
        self.stc.SetTargetEnd(self.stc_equiv_pos)
        
        # The stc equivalent position must be adjusted for the difference in
        # numbers of bytes, not numbers of characters.
        self.stc_equiv_pos += len(replacement.encode('utf-8')) - len(replacing.encode('utf-8'))
        
        self.stc.ReplaceTarget(replacement)
        self.stc.SetSelection(self.stc_equiv_start, self.stc_equiv_pos)
⚠️ **GitHub.com Fallback** ⚠️