Search and replace with regex - electricbookworks/constitution GitHub Wiki

When converting PDF text to markdown, you can speed up the process with regex search-and-replace.

Once we had cropped the original PDFs, saved as .docx and converted to markdown using pandoc, our markdown was still very rough. So our first clean-up step was to run a batch sequence of regex search-and-replaces on each document. While developing the sequence took time (about a day), it saved many days of manual clean up.

(If you're new to regex, this is a good intro.)

Naturally the regex couldn't fix everything and there was still a lot of human clean-up to do (e.g. numbering headings, correcting list-item structure, fixing incorrect cases, etc.).

Our batch regex

To create and run our batch sequence, we used Sublime Text 3 with the RegReplace plugin.

Here are our regex searches (as we saved them in reg_replace.sublime-settings):

{
    "replacements": {
        "const-unblockquote": {
            "find": "\\n>",
            "replace": "\\n"
        },
        "const-remove-html-comments": {
            "find": "<!\\-\\- \\-\\->",
            "replace": ""
        },
        "const-setext-to-hash-h1s":  {
            "find": "(.*)\\n(.*)[\\=]{2,}",
            "replace": "# \\1\\n"
        },
        "const-setext-to-hash-h2s":  {
            "find": "(.*)\\n(.*)[\\-]{2,}",
            "replace": "## \\1\\n"
        },
        "const-unwrap-lines": {
            "find": "(?<!\\n)\\n(?!\\n)",
            "replace": ""
        },
        "const-reduce-text-spaces": {
            "find": "([A-Z,a-z,0-9,\\,\\.,\\;])[ ]{2,}([a-z,A-Z,0-9,\\,\\.,\\;])",
            "replace": "\\1 \\2"
        },
        "const-remove-linestart-spaces": {
            "find": "\\n ([^\\s])",
            "replace": "\\n\\1"
        },
        "const-remove-escape-slashes": {
            "find": "\\\\([\\(,\\).\\[,\\],\\.])",
            "replace": "\\1"
        },
        "const-remove-asterisks": {
            "find": "\\*",
            "replace": ""
        },
        "const-fix-unwrapped-setext-h1s": {
            "find": "\\n\\n([A-Z,a-z,0-9,\\s]*)[\\=]{2,}",
            "replace": "\\n\\n# \\1"
        },
        "const-fix-unwrapped-setext-h2s": {
            "find": "\\n\\n([A-Z,a-z,0-9,\\s]*)[\\-]{2,}",
            "replace": "\\n\\n## \\1"
        },
        "const-list-marker-spaces-to-tabs": {
            "find": "\\n(\\d)\\.[\\s]{2,}",
            "replace": "\\n\\1.\\t"
        },
        "const-remove-unnec-list-parent": {
            "find": "[\\d]+\\.\\s+(\\(1\\))",
            "replace": "(1)"
        },
        "const-all-parent-clauses-no-1": {
            "find": "(?P<heading>[#]+.+\\n\\n)\\d+\\.\\s",
            "replace": "\\g<heading>1.\\t"
        },
        "const-make-clauses-list-items": {
            "find": "\\n\\((\\d)\\)\\s",
            "replace": "\\n\\1.\\t"
        },
        "const-CHECK-indented-romans-fix": {
            "find": "\\n\\t[i,v]+\\.\\s+",
            "replace": "\\n\\t\\t1.\t"
        },
        "const-make-subclauses-list-children": {
            "find": "\\n\\t[a-z]+\\.\\s+",
            "replace": "\\n\\t1.\\t"
        },
        "const-CHECK-unindented-romans-fix": {
            "find": "\\n[i,v]+\\.\\s+",
            "replace": "\\n\\t\\t1.\\t"
        },
        "const-indent-lower-alpha-subclauses": {
            "find": "\\n[a-z]\\.\\s+",
            "replace": "\\n\\t1.\\t"
        },
        "const-close-up-parent-child-list-items": {
            "find": "(\\#.*\\n\\n1\\.\\t(.)+\\n)\\n",
            "replace": "\\1"
        },
        "const-put-back-empty-lines-at-headings": {
            "find": "(.)\\n\\#",
            "replace": "\\1\\n\\n#"
        },
        "const-close-up-clause-list-items": {
            "find": "(\\t.*\\n)\\n(\\d)",
            "replace": "\\1\\2"
        },
        "const-close-up-subclause-list-items": {
            "find": "(\\t.*\\n)\\n(\\t\\d)",
            "replace": "\\1\\2"
        },
        "const-close-up-subsubclause-list-items": {
            "find": "(\\t.*\\n)\\n(\\t\\t\\d)",
            "replace": "\\1\\2"
        },
        "const-close-up-subsubclause-roman-list-items": {
            "find": "(\\t.*\\n)\\n(\\t\\t[i,v])",
            "replace": "\\1\\2"
        },
        "const-tab-spaces-to-tab-tab": {
            "find": "\\t    ",
            "replace": "\\t\\t"
        },
        "const-more-romans-to-list-items": {
            "find": "\\t([i,v]*)\\.",
            "replace": "\\t1."
        },
        "const-fix-parent-turn-into-subclause": {
            "find": "\\n(\\d\\.\\t)\\(a\\) ",
            "replace": "\\n\\1\\n\\t1.\\t"
        },
        "const-bracketed-romans-to-list-items": {
            "find": "\\n\\([i,v]+\\)\\s",
            "replace": "\\n\\t\\t1.\\t"
        },
        "const-markdown-square-bracketed-notes": {
            "find": "\\n(\\[.*\\]\\n)",
            "replace": "\\n\\n> \\1{:.note}\\n"
        },
        "const-remove-multiple-empty-lines": {
            "find": "\\n{3,}",
            "replace": "\\n\\n"
        }
    }
}

Those were then chained together in our User Default.sublime-commands like this:

[
    // All constitution scrubs in one
    {
        "caption": "Constitution scrub: all in one",
        "command": "reg_replace",
        "args": {"replacements": [
            "const-unblockquote",
            "const-remove-html-comments",
            "const-setext-to-hash-h1s",
            "const-setext-to-hash-h2s",
            "const-unwrap-lines",
            "const-reduce-text-spaces",
            "const-remove-linestart-spaces",
            "const-remove-escape-slashes",
            "const-remove-asterisks",
            "const-fix-unwrapped-setext-h1s",
            "const-fix-unwrapped-setext-h2s",
            "const-list-marker-spaces-to-tabs",
            "const-remove-unnec-list-parent",
            "const-all-parent-clauses-no-1",
            "const-make-clauses-list-items",
            "const-CHECK-indented-romans-fix",
            "const-make-subclauses-list-children",
            "const-CHECK-unindented-romans-fix",
            "const-indent-lower-alpha-subclauses",
            "const-close-up-parent-child-list-items",
            "const-put-back-empty-lines-at-headings",
            "const-close-up-clause-list-items",
            "const-close-up-subclause-list-items",
            "const-close-up-subsubclause-list-items",
            "const-close-up-subsubclause-roman-list-items",
            "const-tab-spaces-to-tab-tab",
            "const-more-romans-to-list-items",
            "const-fix-parent-turn-into-subclause",
            "const-bracketed-romans-to-list-items",
            "const-markdown-square-bracketed-notes",
            "const-remove-escape-slashes",
            "const-remove-multiple-empty-lines",
            "const-reduce-text-spaces"
            ]
        }
    },
]

Then to run the sequence in Sublime Text, with the file to fix open, we opened the command palette (Shift+Ctrl+P in Windows) and by typing 'const' we found 'Constitution scrub: all in one' command. Select it, and a few seconds later the work is done!

While we were fixing bugs in the regex, we included this in Default.sublime-commands, too:

    // Debug commands
    {
        "caption": "Constitution: debug regex",
        "command": "reg_replace",
        "args": {"replacements": [
            "const-all-parent-clauses-no-1"
            ], "find_only": true
        }
    },

By changing which argument we were using, and setting it to find-only, we could try out one regex at a time (command palette > 'Constitution: debug regex') and see what it found, or whether it threw an error.

Useful tips

Since we were learning while doing, we learned some things you also may not have known:

  • When running a simple regex in Sublime, one uses the normal backslash escape character. But when you're running regex through another program (in this case RegReplace's), you need double backslashes, because the escaped special characters first have to be escaped for the program itself. Sometimes you need even more backslashes.

  • Sublime and RegReplace use Python regex. To replace with a group followed by a literal number, we had to use Python's special syntax for naming a group:

    "find": "(?P<heading>[#]+.+\\n\\n)\\d+\\.\\s",
    "replace": "\\g<heading>1.\\t"
    
⚠️ **GitHub.com Fallback** ⚠️