File conversion - electricbookworks/constitution GitHub Wiki

Apart from the English language version, we've worked from the official DOJ PDF versions of the constitution.

Our conversion process developed as a work in progress. This is a record of that process.

  1. Open PDF in Acrobat Pro

  2. Delete prelim and endmatter pages (we'll create those manually later in markdown)

  3. Crop pages to remove header and footer

  4. Save as Word (.docx)

  5. Convert .docx to markdown (.md) with pandoc:

    • at the command line, navigate to the folder containing the files to convert (using cd or in Windows 10, type powershell into the address bar of the file explorer when in the relevant folder)
    • Run this command (changing file to the name of your file): pandoc -S -f docx -t markdown file.docx --output=file.md
  6. Save markdown file in relevant language folder as scrub-en.md while we work on it (where en is each language code).

  7. Run the scrub file through a batch regex search-and-replace.

  8. Clean up markdown manually (in a good code editor like Sublime Text 3 with the MarkdownEditing package installed and its syntax set to MultiMarkdown).

  9. Divide into separate files per chapter/schedule/annexure and create prelims, copy-pasting from PDF over a copy of the English-version prelims.