Wiki; GitHub Reconstruction - HWRM/KarosGraveyard GitHub Wiki
In May 2020 all one thousand plus pages of the original Karos Graveyard were fully reconstructed on GitHub. The process is documented below:
- Downloaded the Karos Graveyard Offline CHM, with content from September 2007.
- Opened the KarosGraveyard.CHM file with 7zip, and extracted all the HTML files.
- 18 HTML files had odd encoding and had to be resaved in UTF-8. Some punchuation characters turned into question marks, and were later manually fixed.
- Used Pandoc v2.9.2.1, to convert the HTML pages to GitHub's Markdown format.
- Ran the following command in PowerShell to test convert one page:
pandoc FileName.html -f html-native_divs-native_spans -t gfm --wrap=none -s -o FileName.md
- Ran the following command in PowerShell to convert all pages:
gci -r -i *.html |foreach{$md=$_.directoryname+"\"+$_.basename+".md";pandoc -f html-native_divs-native_spans -t gfm --wrap=none -s $_.name -o $md}
- Opened the directory containing all Markdown pages in Visual Studio Code. Ran a series of find and replace operations on all pages as documented below.
_Format_
Action To Take:
Text to Find
Text to Replace
Text to Find
Text to Replace
_RegEx Replace_
Change Header:
^Homeworld 2 : \[
**Homeworld Remastered Karos Graveyard** : [
\*\*\[Function Reference\]\(FunctionReference.html\)\*\* :: \[Scope Reference\]\(ScopeReference.html\) :: \[Variable Reference\]\(VariableReference.html\)\n\n
nothing
Remove extra leading spaces:
^ -
-
Remove link picture icons: (turn RegEx off for first operation)
http://wiki.hw2.info/images/url.png
url.png
!\[([^\[]+)\]\(url.png\)
nothing
Change comments heading:
Comments \\\[\[Hide comments/form\]\(.+
# Comments
There are .+ comments on this page. .+
# Comments
There is .+ comment on this page. .+
# Comments
There are no comments on this page. \\\[\[Add comment\]\(.+
# Comments
Change footer:
\[Page History\]\(.+\n
nothing
\[Valid XHTML 1.0 Transitional\]\(.+\n\n
nothing
Page was generated in .+ seconds
# Page Status
Updated Formatting? Initial
Updated for HWRM? Initial
_Normal Replace_
Fix internal links:
.html)
)
/edit.html "Create this page")
)
/edit "Create this page")
)
http://wiki.hw2.info/
nothing
Make old links usable:
(http://
(http://web.archive.org/*/
- Ran many additional find and replace operations as needed. The results of which can be seen in the Revision History during May and June 2020.
- Some multiple-line code formatting did not get converted well. Opened the directory containing all original HTML pages in Visual Studio Code. Ran a series of RegEx find operations on all pages as documented below. Then manually corrected the code in the corresponding markdown files.
^[^t]*?</tt>
manual fix
^[^>]*?</tt>
manual fix
"code"
manual fix
- As of this writing, this conversion appears to have worked very effectively. The most notable loss seems to be the loss of HTML indent formatting. It did not appear possible to properly convert the old inconsistent HTML indents into Markdown bullet trees. Attempts produced significant inconsistencies and issues. Hence some pages may need to be manually formatted with Markdown bullet trees as needed.
Updated Formatting? Yes
Updated for HWRM? Yes