Markdown Converter Guide - jnleyva816/NextMove GitHub Wiki

Markdown Converter Guide

📝 Universal HTML to Markdown conversion system for clean job descriptions

This guide covers the DescriptionConverter service implementation, which provides robust HTML to Markdown conversion for all job parsers in NextMove.


Table of Contents

  1. Overview
  2. Architecture
  3. Key Features
  4. Implementation Details
  5. Usage Examples
  6. Configuration
  7. Testing
  8. Integration
  9. Troubleshooting
  10. Performance

Overview

The DescriptionConverter service is a universal HTML to Markdown conversion system designed to transform messy HTML job descriptions into clean, readable Markdown format. This service is used by all job parsers to ensure consistent formatting across different job sites.

Problem Solved

Before the DescriptionConverter:

  • ❌ Inconsistent job description formatting across different sites
  • ❌ Raw HTML content difficult to read and process
  • ❌ Poor user experience with cluttered job descriptions
  • ❌ Manual formatting required for each job parser

After the DescriptionConverter:

  • ✅ Consistent, clean Markdown formatting for all job descriptions
  • ✅ Automatic HTML cleanup and preprocessing
  • ✅ Structured section extraction and formatting
  • ✅ Universal service used by all parsers

Architecture

System Design

flowchart TD
    A[Raw HTML Content] --> B[DescriptionConverter Service]
    B --> C[HTML Preprocessing]
    C --> D[Flexmark HTML to Markdown]
    D --> E[Markdown Post-processing]
    E --> F[Clean Markdown Output]
    
    C --> G[Remove Unwanted Elements]
    C --> H[Normalize List Structures]
    C --> I[Identify Headers]
    
    E --> J[Clean Multiple Newlines]
    E --> K[Fix Formatting Issues]
    E --> L[Structure Sections]
Loading

Core Components

  1. HTML Preprocessor: Cleans and prepares HTML content
  2. Flexmark Converter: Professional HTML to Markdown conversion
  3. Markdown Post-processor: Polishes the final output
  4. Section Extractor: Identifies and structures job description sections
  5. Fallback Handler: Provides plain text extraction when conversion fails

Key Features

🔧 HTML Preprocessing

  • Element Removal: Removes scripts, styles, navigation, and hidden elements
  • Empty Element Cleanup: Removes empty paragraphs and divs
  • Header Detection: Automatically identifies text that should be headers
  • List Normalization: Standardizes list structures for better conversion

🎨 Markdown Conversion

  • Professional Conversion: Uses Flexmark library for robust HTML to Markdown conversion
  • Configurable Options: Optimized settings for job description formatting
  • Emphasis Preservation: Maintains important formatting like bold and italic text
  • Link Handling: Properly converts links and maintains readability

🏗️ Post-processing

  • Whitespace Cleanup: Removes excessive newlines and trailing spaces
  • Header Normalization: Ensures consistent header formatting
  • Section Structure: Organizes content into logical sections
  • Duplicate Removal: Eliminates duplicate headers and content

🔍 Structured Extraction

  • Section Identification: Automatically identifies common job description sections
  • Smart Parsing: Handles various HTML structures and layouts
  • Fallback Support: Graceful degradation when structured extraction fails

Implementation Details

Core Service Class

@Service
public class DescriptionConverter {
    
    private static final Logger logger = LoggerFactory.getLogger(DescriptionConverter.class);
    private final FlexmarkHtmlConverter htmlToMarkdownConverter;
    
    // Regex patterns for cleanup
    private static final Pattern MULTIPLE_NEWLINES = Pattern.compile("\\n{3,}");
    private static final Pattern TRAILING_SPACES = Pattern.compile(" +$", Pattern.MULTILINE);
    private static final Pattern EMPTY_HEADERS = Pattern.compile("^#+\\s*$", Pattern.MULTILINE);
    
    public DescriptionConverter() {
        // Configure Flexmark options for optimal conversion
        MutableDataSet options = new MutableDataSet();
        options.set(FlexmarkHtmlConverter.SETEXT_HEADINGS, false);
        options.set(FlexmarkHtmlConverter.OUTPUT_UNKNOWN_TAGS, false);
        options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_QUOTES, false);
        
        this.htmlToMarkdownConverter = FlexmarkHtmlConverter.builder(options).build();
    }
}

Main Conversion Method

public String convertToMarkdown(String htmlContent) {
    if (htmlContent == null || htmlContent.trim().isEmpty()) {
        return null;
    }
    
    try {
        logger.debug("Converting HTML content to Markdown, length: {}", htmlContent.length());
        
        // Step 1: Clean and preprocess the HTML
        String cleanedHtml = preprocessHtml(htmlContent);
        
        // Step 2: Convert to Markdown
        String markdown = htmlToMarkdownConverter.convert(cleanedHtml);
        
        // Step 3: Post-process the Markdown for better readability
        String finalMarkdown = postProcessMarkdown(markdown);
        
        logger.debug("Conversion complete, Markdown length: {}", finalMarkdown.length());
        return finalMarkdown;
        
    } catch (Exception e) {
        logger.error("Error converting HTML to Markdown", e);
        return extractPlainTextFallback(htmlContent);
    }
}

HTML Preprocessing

private String preprocessHtml(String html) {
    Document doc = Jsoup.parse(html);
    
    // Remove unwanted elements
    doc.select("script, style, nav, header, footer, .hidden, [style*='display:none']").remove();
    
    // Remove empty paragraphs and divs
    doc.select("p:empty, div:empty, span:empty").remove();
    
    // Convert common patterns to more semantic HTML
    Elements elements = doc.select("*");
    for (Element element : elements) {
        String text = element.ownText();
        
        // Convert text that looks like headers
        if (isLikelyHeader(text, element)) {
            element.tagName("h3");
        }
        
        // Clean up excessive whitespace in text nodes
        if (!text.trim().isEmpty()) {
            element.text(text.replaceAll("\\s+", " ").trim());
        }
    }
    
    // Normalize list structures
    normalizeListStructures(doc);
    
    return doc.body().html();
}

Structured Section Extraction

public String extractStructuredDescription(Document doc, String... possibleSelectors) {
    if (doc == null) {
        return null;
    }
    
    StringBuilder description = new StringBuilder();
    
    // Try to find the main description element
    Element descriptionElement = null;
    for (String selector : possibleSelectors) {
        descriptionElement = doc.selectFirst(selector);
        if (descriptionElement != null) {
            break;
        }
    }
    
    if (descriptionElement == null) {
        logger.warn("Could not find description element using provided selectors");
        return null;
    }
    
    // Look for structured sections
    List<Section> sections = extractSections(descriptionElement);
    
    if (!sections.isEmpty()) {
        // We found structured content
        for (Section section : sections) {
            if (section.title != null && !section.title.trim().isEmpty()) {
                description.append("## ").append(section.title).append("\n\n");
            }
            
            String sectionMarkdown = convertToMarkdown(section.content);
            if (sectionMarkdown != null && !sectionMarkdown.trim().isEmpty()) {
                description.append(sectionMarkdown).append("\n\n");
            }
        }
    } else {
        // No structured sections found, convert the whole thing
        description.append(convertElementToMarkdown(descriptionElement));
    }
    
    return description.toString().trim();
}

Usage Examples

Basic HTML to Markdown Conversion

@Autowired
private DescriptionConverter descriptionConverter;

// Convert HTML job description to Markdown
String htmlContent = "<h2>Job Requirements</h2><ul><li>5+ years experience</li><li>Java expertise</li></ul>";
String markdown = descriptionConverter.convertToMarkdown(htmlContent);

// Result:
// ## Job Requirements
// 
// * 5+ years experience
// * Java expertise

Converting JSoup Elements

// When you already have a JSoup Element
Document doc = Jsoup.connect("https://job-site.com/job/123").get();
Element descriptionElement = doc.selectFirst(".job-description");

String markdown = descriptionConverter.convertElementToMarkdown(descriptionElement);

Structured Description Extraction

// Extract structured job description with multiple possible selectors
Document doc = Jsoup.connect("https://job-site.com/job/123").get();

String structuredMarkdown = descriptionConverter.extractStructuredDescription(
    doc,
    ".job-description",
    ".description-content",
    "#job-details"
);

Integration with Job Parsers

@Service
public class EnhancedJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    public JobApplication parseJob(String jobUrl) {
        Document doc = Jsoup.connect(jobUrl).get();
        
        // Extract basic job information
        String title = doc.selectFirst(".job-title").text();
        String company = doc.selectFirst(".company-name").text();
        
        // Convert description using the universal converter
        String description = descriptionConverter.extractStructuredDescription(
            doc,
            ".job-description",
            ".description"
        );
        
        return JobApplication.builder()
            .title(title)
            .company(company)
            .description(description)
            .build();
    }
}

Configuration

Flexmark Configuration Options

The service uses optimized Flexmark settings for job description conversion:

MutableDataSet options = new MutableDataSet();

// Use ATX headings (##) instead of Setext headings
options.set(FlexmarkHtmlConverter.SETEXT_HEADINGS, false);

// Skip unknown HTML tags for cleaner output
options.set(FlexmarkHtmlConverter.OUTPUT_UNKNOWN_TAGS, false);

// Keep simple quotes instead of typographic quotes
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_QUOTES, false);

// Keep simple punctuation
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_SMARTS, false);

// Don't wrap auto-detected links
options.set(FlexmarkHtmlConverter.WRAP_AUTO_LINKS, false);

// Skip HTML comments
options.set(FlexmarkHtmlConverter.RENDER_COMMENTS, false);

// Use . for numeric lists
options.set(FlexmarkHtmlConverter.DOT_ONLY_NUMERIC_LISTS, true);

Logging Configuration

Enable debug logging for conversion monitoring:

# Enable DescriptionConverter debug logging
logging.level.com.jnleyva.nextmove_backend.service.DescriptionConverter=DEBUG

# Enable HTML parsing debug information
logging.level.org.jsoup=WARN

Testing

Comprehensive Test Suite

The DescriptionConverter includes 100+ test cases covering various scenarios:

@ExtendWith(MockitoExtension.class)
class DescriptionConverterTest {
    
    private DescriptionConverter descriptionConverter;
    
    @BeforeEach
    void setUp() {
        descriptionConverter = new DescriptionConverter();
    }
    
    @Test
    void testBasicHtmlToMarkdown() {
        String html = "<h2>Title</h2><p>Description with <strong>bold</strong> text.</p>";
        String result = descriptionConverter.convertToMarkdown(html);
        
        assertThat(result).contains("## Title");
        assertThat(result).contains("**bold**");
    }
    
    @Test
    void testListConversion() {
        String html = "<ul><li>First item</li><li>Second item</li></ul>";
        String result = descriptionConverter.convertToMarkdown(html);
        
        assertThat(result).contains("* First item");
        assertThat(result).contains("* Second item");
    }
    
    @Test
    void testComplexJobDescription() {
        String html = loadTestHtml("complex-job-description.html");
        String result = descriptionConverter.convertToMarkdown(html);
        
        assertThat(result).isNotNull();
        assertThat(result).doesNotContain("<script>");
        assertThat(result).doesNotContain("<style>");
        assertThat(result.length()).isGreaterThan(100);
    }
}

Test Categories

  1. Basic Conversion Tests: Simple HTML to Markdown conversion
  2. Preprocessing Tests: HTML cleanup and preprocessing
  3. Post-processing Tests: Markdown cleanup and formatting
  4. Structured Extraction Tests: Section identification and extraction
  5. Error Handling Tests: Fallback scenarios and error conditions
  6. Performance Tests: Large content handling and performance metrics
  7. Integration Tests: End-to-end testing with real job sites

Test Data

Test files include real-world examples from various job sites:

  • greenhouse-job-description.html - Greenhouse format job descriptions
  • meta-job-description.html - Meta/Facebook job postings
  • microsoft-job-description.html - Microsoft careers page content
  • complex-nested-structure.html - Complex nested HTML structures
  • malformed-html.html - Invalid or malformed HTML content

Integration

Job Parser Integration

All job parsers now use the DescriptionConverter for consistent formatting:

Greenhouse Job Parser

@Service
public class GreenhouseJobParser extends BaseJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    protected String extractDescription(Document doc) {
        return descriptionConverter.extractStructuredDescription(
            doc,
            "#content",
            ".application-description"
        );
    }
}

Microsoft Job Parser

@Service
public class MicrosoftJobParser extends BaseJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    protected String extractDescription(Document doc) {
        return descriptionConverter.extractStructuredDescription(
            doc,
            "[data-automation-id='jobPostingDescription']",
            ".job-description"
        );
    }
}

Meta Job Parser

@Service
public class MetaJobParser extends BaseJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    protected String extractDescription(Document doc) {
        return descriptionConverter.extractStructuredDescription(
            doc,
            "[data-testid='job-description']",
            ".job-description-content"
        );
    }
}

Dependency Configuration

Add Flexmark dependency to pom.xml:

<dependency>
    <groupId>com.vladsch.flexmark</groupId>
    <artifactId>flexmark-html2md-converter</artifactId>
    <version>0.64.8</version>
</dependency>

Troubleshooting

Common Issues

1. Poor Conversion Quality

Problem: HTML converts to poorly formatted Markdown Solution:

  • Check HTML preprocessing settings
  • Verify Flexmark configuration options
  • Review input HTML structure

2. Missing Content

Problem: Some HTML content is not appearing in Markdown Solution:

  • Check if content is being removed in preprocessing
  • Verify element selectors are correct
  • Enable debug logging to trace conversion steps

3. Performance Issues

Problem: Conversion is slow for large HTML content Solution:

  • Implement caching for repeated conversions
  • Optimize HTML preprocessing
  • Consider async processing for large batches

4. Encoding Issues

Problem: Special characters not displaying correctly Solution:

  • Ensure proper character encoding in HTML input
  • Check Flexmark encoding settings
  • Verify database column encoding

Debug Configuration

// Enable detailed debug logging
logger.debug("Input HTML length: {}", htmlContent.length());
logger.debug("Cleaned HTML length: {}", cleanedHtml.length());
logger.debug("Final Markdown length: {}", finalMarkdown.length());

// Log conversion steps
if (logger.isDebugEnabled()) {
    logger.debug("Preprocessing removed {} characters", 
                htmlContent.length() - cleanedHtml.length());
    logger.debug("Conversion result preview: {}", 
                finalMarkdown.substring(0, Math.min(200, finalMarkdown.length())));
}

Health Monitoring

@Component
public class DescriptionConverterHealthIndicator implements HealthIndicator {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    public Health health() {
        try {
            // Test basic conversion functionality
            String testHtml = "<h2>Test</h2><p>Test content</p>";
            String result = descriptionConverter.convertToMarkdown(testHtml);
            
            if (result != null && result.contains("## Test")) {
                return Health.up()
                    .withDetail("converter", "operational")
                    .withDetail("flexmark", "ready")
                    .build();
            } else {
                return Health.down()
                    .withDetail("converter", "conversion_failed")
                    .build();
            }
        } catch (Exception e) {
            return Health.down()
                .withDetail("converter", "error")
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

Performance

Optimization Strategies

  1. Caching: Cache converted content to avoid repeated processing
  2. Async Processing: Process large batches asynchronously
  3. Memory Management: Optimize memory usage for large HTML content
  4. Preprocessing Optimization: Minimize DOM manipulation operations

Performance Metrics

Typical conversion performance:

  • Small HTML (< 1KB): < 5ms
  • Medium HTML (1-10KB): 5-25ms
  • Large HTML (10-100KB): 25-100ms
  • Very Large HTML (> 100KB): 100-500ms

Benchmarking

@Test
void performanceTest() {
    String largeHtml = loadTestHtml("large-job-description.html");
    
    long startTime = System.currentTimeMillis();
    String result = descriptionConverter.convertToMarkdown(largeHtml);
    long endTime = System.currentTimeMillis();
    
    long conversionTime = endTime - startTime;
    
    assertThat(conversionTime).isLessThan(1000); // Should complete in < 1 second
    assertThat(result).isNotNull();
    assertThat(result.length()).isGreaterThan(0);
}

Future Enhancements

Planned Features

  • Custom Markdown Templates: Configurable output templates for different job types
  • AI-Powered Section Detection: Use ML to improve section identification
  • Multi-language Support: Support for non-English job descriptions
  • Custom Styling: Configurable Markdown styling options
  • Batch Processing: Optimized batch conversion for multiple jobs
  • Analytics: Conversion quality metrics and analytics

Advanced Configuration

  • Plugin System: Extensible plugin architecture for custom processors
  • Rule Engine: Configurable rules for HTML preprocessing
  • Template Engine: Custom templates for different job sites
  • Quality Scoring: Automatic quality assessment of conversions

📝 Note: The DescriptionConverter is designed to be maintainable and extensible. All job parsers should use this service for consistent formatting across the platform.

⚠️ **GitHub.com Fallback** ⚠️