.pr_agent_accepted_suggestions - Ktiseos-Nyx/Dataset-Tools GitHub Wiki

                     PR 119 (2025-07-13)                    
[general] Add JIS encoding support

✅ Add JIS encoding support

The method attempts to decode UserComment bytes with different encodings but doesn't handle JIS encoding, which is a standard EXIF encoding format. Add support for JIS encoding to ensure proper handling of Japanese text in EXIF metadata.

dataset_tools/metadata_engine/context_preparation.py [198-216]

 def _decode_usercomment_bytes_robust(self, data: bytes) -> str:
     """Try various decoding strategies for UserComment bytes. This is the secret sauce."""
     if not data:
         return ""
 
     # Check for standard EXIF UserComment format with encoding header
     if len(data) >= 8:
         try:
             codec_header = data[:8]
             comment_bytes = data[8:]
 
             # Try to decode based on the header
             try:
                 if codec_header.startswith(b"ASCII\x00"):
                     return comment_bytes.decode("ascii").strip("\x00")
                 if codec_header.startswith(b"UNICODE\x00"):  # A common variation for UTF-16
                     return comment_bytes.decode("utf-16le").strip("\x00")
                 if codec_header.startswith(b"UTF-8\x00"):
                     return comment_bytes.decode("utf-8").strip("\x00")
+                if codec_header.startswith(b"JIS\x00"):  # Support for Japanese encoding
+                    return comment_bytes.decode("shift_jis").strip("\x00")
             except UnicodeDecodeError:
                 pass  # Fall through to other methods

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that adding JIS encoding support improves the robustness of EXIF UserComment decoding, which is a valuable enhancement for handling more image formats.


[possible issue] Handle invalid XML characters

✅ Handle invalid XML characters

The XML parsing doesn't handle malformed XML with invalid characters that could cause parsing failures. Add a pre-processing step to remove or replace common problematic XML characters before parsing to improve robustness.

dataset_tools/metadata_engine/extractors/xmp_extractors.py [40-51]

 def _parse_xmp_string(self, xmp_content: str) -> ET.Element | None:
     """Parse XMP string content into XML tree."""
     if not xmp_content:
         return None
 
     try:
         # Clean up XMP content - remove any leading/trailing whitespace
         xmp_clean = xmp_content.strip()
-
+        
+        # Remove or replace invalid XML characters
+        xmp_clean = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]', '', xmp_clean)
+        
         # Parse the XML
         root = ET.fromstring(xmp_clean)
         return root

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that malformed XML can cause parsing errors and proposes a reasonable fix to strip invalid characters, which improves the robustness of the XMP parser.


[general] Move import to module level

✅ Move import to module level

The pathlib import is placed inside the loop, causing it to be repeatedly imported on each iteration. Move the import statement to the top of the file with other imports to improve performance and follow Python's import conventions.

dataset_tools/background_operations.py [261-268]

 try:
     progress = int((i / total_files) * 100)
-    from pathlib import Path
-
     self.emit_progress(
         progress,
         "Processing %d/%d: %s" % (i + 1, total_files, Path(file_path).name),
     )

Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that importing pathlib inside a loop is inefficient and against best practices, and moving it to the top level improves performance and code quality.



                     PR 105 (2025-07-07)                    
[general] Remove duplicate method

✅ Remove duplicate method

This method is defined twice in the file (lines 489-502 and 1010-1028) with similar functionality but slightly different implementations. Consolidate these duplicate methods to avoid confusion and maintain a single source of truth.

dataset_tools/metadata_engine/extractors/comfyui_extractors.py [446-458]

 def _clean_prompt_text(self, text: str) -> str:
     """
     Clean embedding prefixes and other artifacts from prompt text.
     """
     if not isinstance(text, str):
         return text
 
+    # Remove embedding prefixes
     text = re.sub(r'^embedding:negatives\\?', '', text, flags=re.IGNORECASE)
     text = re.sub(r'^embedding:', '', text, flags=re.IGNORECASE)
     text = re.sub(r'^negatives\\', '', text, flags=re.IGNORECASE)
+    
+    # Clean up whitespace
     text = text.strip()
 
     return text

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the method _clean_prompt_text is defined twice in the file (lines 446-458 and 861-883), and removing this duplication improves code quality and maintainability.


[general] Add missing newline at EOF

✅ Add missing newline at EOF

The file is missing a newline at the end, which violates Python coding standards and may cause issues with some tools. Add a newline character at the end of the file.

dataset_tools/analysis/data_models.py [28-30]

+has_known_finetune_or_lora: bool = Field(
+    ..., description="A summary boolean that is True if any specific feature was found."
+)
 
-

Suggestion importance[1-10]: 2

__

Why: The suggestion correctly identifies a missing newline at the end of the file, which is a minor style issue, but the improved_code is identical to the existing_code.



⚠️ **GitHub.com Fallback** ⚠️