Skip to content Skip to sidebar Skip to footer

Python Exception Thrown By Libtidy Is Amusingly Impossible To Catch

I am trying to use the tidy_document() function from tidylib to format an html document as xhtml before I can post it somewhere and a couple of steps up the stack, an exception is

Solution 1:

I managed to reproduce the problem on Win (saved the HTML snippet in a file). Below is the last code variant.

code00.py:

#!/usr/bin/env pythonimport sys
import os
import threading

os.environ["PATH"] += os.pathsep + os.path.abspath(os.path.dirname(__file__))  # Built tidy.dll in the cwd, this is needed for it to be foundfrom tidylib import tidy_document


defmain(*argv):
    print("main - TID: {0:d}".format(threading.get_ident()))
    mode = "rb"
    raw_content = open("content.html", mode=mode).read()
    enc = "utf-8"iflen(sys.argv) < 2else sys.argv[1]
    html_content = raw_content.decode(enc)
    print(html_content.encode(enc) == raw_content)
    withopen("content_utf8.html", "w", encoding=enc) as fout:
        fout.write(html_content)
    try:
        xhtml_doc, errors = tidy_document(html_content)
    except UnicodeDecodeError as ude:
        print("Caught the exception:", ude)
    except UnicodeError as ue:
        print("Caught the exception:", ue)
    except Exception as ex:
        print("Caught the exception:", ex)
    except:
        print("Caught an exception")


if __name__ == "__main__":
    print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64if sys.maxsize > 0x100000000else32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q059054833]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py
Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 64bit on win32

main - TID: 9528
True
Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940>
Traceback (most recent call last):
File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data
Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940>
Traceback (most recent call last):
File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte
Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940>
Traceback (most recent call last):
File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte

Done.

I tested (temporarily modified sink.py), and they are indeed in the same thread. Then, I looked more closely at the stacktrace, and figured it out:

  1. PyTidyLib calls some C code from the backend Tidy library (tidy.dll), via CTypes
  2. The (above) C code calls some Python code (Sink.put_byte), as a callback that was passed to it together with the arguments
  3. The (Python) code from previous step raises an exception, but the underlying C code (that calls it) doesn't "know" how pass it back to #1., as it has no Python "knowledge" whatsoever (so the exception "dies" there)

That's why you couldn't catch it in Python.

I tried reading the files with different other encodings, but no luck. Then I did some more debugging, and it seems like there are 3 invalid UTF-8 characters (\x07, \xAA, \xB6 - when combined with other ones) in your file. Of course, trying to decode an UTF-8 character out of a single byte seems strange to me, but that might be a PyTidyLib bug.


Update #0

Since I had to build tidy.dll (as I didn't want to start LnxVMs or install the .whl under Cygwin) to do all the tests, I also uploaded it (and other artifacts) to [GitHub]: CristiFati/Prebuilt-Binaries - Prebuilt-Binaries/HTML-Tidy/v5.7.28.

Post a Comment for "Python Exception Thrown By Libtidy Is Amusingly Impossible To Catch"