Unicode string silently truncated - ebranca/owasp-pysec GitHub Wiki
Classification
-
Affected Components : codecs
-
Operating System : Linux
-
Python Versions : 2.6.x, 2.7.x
-
Reproducible : Yes
Source code
# -*- coding: utf-8 -*-
import codecs
import io
import sys
try:
ascii
except NameError:
ascii = repr
b = b'\x41\xF5\x42\x43\xF4'
print("Correct-String %r") % ((ascii(b.decode('utf8', 'replace'))))
with open('temp.bin', 'wb') as fout:
fout.write(b)
with codecs.open('temp.bin', encoding='utf8', errors='replace') as fin:
print("TEST1-String %r") % (ascii(fin.read()))
with io.open('temp.bin', 'rt', encoding='utf8', errors='replace') as fin:
print("TEST2-String %r") % (ascii(fin.read()))
sys.exit(0)
Steps to Produce/Reproduce
To reproduce the problem copy the source code
in a file and execute the script using the following command syntax:
$ python -OOBRtt test.py
Alternatively you can open python in interactive mode:
$ python -OOBRtt <press enter>
Then copy the lines of code into the interpreter.
Description
Execution of the test script produces the following results:
Correct-String "u'A\\ufffdBC\\ufffd'"
TEST1-String "u'A\\ufffdBC'"
TEST2-String "u'A\\ufffdBC\\ufffd'"
The problem is due to a problem in the codecs
module that detects the character F4
and assumes this is the first character of a sequence of characters and waits to receive the remaining 3 bytes, as a consequence the resulting string is truncated.
Source string used as reference:
Correct-String "u'A\\ufffdBC\\ufffd'"
codecs
How the string is printed if processed by the module TEST1-String "u'A\\ufffdBC'"
A better and safer approach would be to read the entire stream and only then proceed to the decoding phase, as done by the io
module.
io
How the string is printed if processed by the module TEST2-String "u'A\\ufffdBC\\ufffd'"
Workaround
We are not aware on any easy solution other than trying to avoid using 'codecs'
in cases like the one examined.
Secure Implementation
WORK IN PROGRESS
References
[Python module io][01] [01]:https://docs.python.org/2/library/io.html
[Python module codecs][02] [02]:https://docs.python.org/2/library/codecs.html
[Python bug 12508][03] [03]:http://bugs.python.org/issue12508