Misc Tips and Tricks - HinTak/cxterm GitHub Wiki
Identifying encoding with file ...
. "ISO-8859-anything" is basically "European (with accents)"; big5 and gb2312 would also fall under that.
$ file pho*
phone.pdf: PDF document, version 1.5
phone-gb.txt: ISO-8859 text
phone-utf8.txt: UTF-8 Unicode text
Conversion (from, to, < input , > output):
$ iconv -f utf8 -t gb2312 < phone-utf8.txt > phone-gb.txt
Sometimes it fails; you can force conversion, dropping unconvertible characters with -c
:
$ iconv -c -f utf8 -t gb2312 < phone-utf8.txt > phone-gb.txt
Convert to the much-larger gb18030
(without -c
):
$ iconv -f utf8 -t gb18030 < phone-utf8.txt > phone-gb18030.txt
Then find the difference between the two converted versions with:
$ diff -a phone-gb.txt phone-gb18030.txt | grep -a '^<'
The above means, find "a"ll differences between two files, even if non-ascii. The output would be prefixed with "<" and ">" for lines differing. Extract any lines which begins ("^") with "<", again "a"ll lines even if non-ascii. We want the dropped version since the gb18030 lines (extract by '^>') won't display under cxterm either.
In the cxterm source distribution, there is a emacs lisp file cxterm/emacs/cemacs.el
, which sets up emacs to co-operate with cxterm. So typically you do ... -nw -l emacs/cemacs.el ...
(-nw
means no-new-windows, i.e. run inside existing terminal):
$ emacs ... [other options] ... -nw -l emacs/cemacs.el ... [other options and arguments] ...
To pursuade the GUI emacs (without -nw
) to interprete iso-8859-X files as GB2312 or Big5:
$ LANG=zh_CN.GB2312 emacs phone-gb.txt
$ LANG=zh_TW.Big5 emacs phone-big5.txt
In /etc/httpd/conf/httpd.conf
(server-wide, Apache 2.4.48 on Fedora), or .htaccess
(per-directory):
AddDefaultCharset GB2312
AddCharset GB2312 .txt
To get "*.txt" files be shown as GB2312 on web browers. The default is Content-Type: text/html; charset=iso-8859-1
in the HTTP header.
See a related Apache bugzilla issue for background information.
See AddCharset in mod_mime and
AddDefaultCharset in core Apache for details; and
IANA Registered Character Sets