Markdown Converter Guide

📝 Universal HTML to Markdown conversion system for clean job descriptions

This guide covers the DescriptionConverter service implementation, which provides robust HTML to Markdown conversion for all job parsers in NextMove.

Overview
Architecture
Key Features
Implementation Details
Usage Examples
Configuration
Testing
Integration
Troubleshooting
Performance

Overview

The DescriptionConverter service is a universal HTML to Markdown conversion system designed to transform messy HTML job descriptions into clean, readable Markdown format. This service is used by all job parsers to ensure consistent formatting across different job sites.

Problem Solved

Before the DescriptionConverter:

❌ Inconsistent job description formatting across different sites
❌ Raw HTML content difficult to read and process
❌ Poor user experience with cluttered job descriptions
❌ Manual formatting required for each job parser

After the DescriptionConverter:

✅ Consistent, clean Markdown formatting for all job descriptions
✅ Automatic HTML cleanup and preprocessing
✅ Structured section extraction and formatting
✅ Universal service used by all parsers

Architecture

System Design

flowchart TD
    A[Raw HTML Content] --> B[DescriptionConverter Service]
    B --> C[HTML Preprocessing]
    C --> D[Flexmark HTML to Markdown]
    D --> E[Markdown Post-processing]
    E --> F[Clean Markdown Output]
    
    C --> G[Remove Unwanted Elements]
    C --> H[Normalize List Structures]
    C --> I[Identify Headers]
    
    E --> J[Clean Multiple Newlines]
    E --> K[Fix Formatting Issues]
    E --> L[Structure Sections]

Core Components

HTML Preprocessor: Cleans and prepares HTML content
Flexmark Converter: Professional HTML to Markdown conversion
Markdown Post-processor: Polishes the final output
Section Extractor: Identifies and structures job description sections
Fallback Handler: Provides plain text extraction when conversion fails

Key Features

🔧 HTML Preprocessing

Element Removal: Removes scripts, styles, navigation, and hidden elements
Empty Element Cleanup: Removes empty paragraphs and divs
Header Detection: Automatically identifies text that should be headers
List Normalization: Standardizes list structures for better conversion

🎨 Markdown Conversion

Professional Conversion: Uses Flexmark library for robust HTML to Markdown conversion
Configurable Options: Optimized settings for job description formatting
Emphasis Preservation: Maintains important formatting like bold and italic text
Link Handling: Properly converts links and maintains readability

🏗️ Post-processing

Whitespace Cleanup: Removes excessive newlines and trailing spaces
Header Normalization: Ensures consistent header formatting
Section Structure: Organizes content into logical sections
Duplicate Removal: Eliminates duplicate headers and content

🔍 Structured Extraction

Section Identification: Automatically identifies common job description sections
Smart Parsing: Handles various HTML structures and layouts
Fallback Support: Graceful degradation when structured extraction fails

Implementation Details

Core Service Class

@Service
public class DescriptionConverter {
    
    private static final Logger logger = LoggerFactory.getLogger(DescriptionConverter.class);
    private final FlexmarkHtmlConverter htmlToMarkdownConverter;
    
    // Regex patterns for cleanup
    private static final Pattern MULTIPLE_NEWLINES = Pattern.compile("\\n{3,}");
    private static final Pattern TRAILING_SPACES = Pattern.compile(" +$", Pattern.MULTILINE);
    private static final Pattern EMPTY_HEADERS = Pattern.compile("^#+\\s*$", Pattern.MULTILINE);
    
    public DescriptionConverter() {
        // Configure Flexmark options for optimal conversion
        MutableDataSet options = new MutableDataSet();
        options.set(FlexmarkHtmlConverter.SETEXT_HEADINGS, false);
        options.set(FlexmarkHtmlConverter.OUTPUT_UNKNOWN_TAGS, false);
        options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_QUOTES, false);
        
        this.htmlToMarkdownConverter = FlexmarkHtmlConverter.builder(options).build();
    }
}

Main Conversion Method

public String convertToMarkdown(String htmlContent) {
    if (htmlContent == null || htmlContent.trim().isEmpty()) {
        return null;
    }
    
    try {
        logger.debug("Converting HTML content to Markdown, length: {}", htmlContent.length());
        
        // Step 1: Clean and preprocess the HTML
        String cleanedHtml = preprocessHtml(htmlContent);
        
        // Step 2: Convert to Markdown
        String markdown = htmlToMarkdownConverter.convert(cleanedHtml);
        
        // Step 3: Post-process the Markdown for better readability
        String finalMarkdown = postProcessMarkdown(markdown);
        
        logger.debug("Conversion complete, Markdown length: {}", finalMarkdown.length());
        return finalMarkdown;
        
    } catch (Exception e) {
        logger.error("Error converting HTML to Markdown", e);
        return extractPlainTextFallback(htmlContent);
    }
}

HTML Preprocessing

private String preprocessHtml(String html) {
    Document doc = Jsoup.parse(html);
    
    // Remove unwanted elements
    doc.select("script, style, nav, header, footer, .hidden, [style*='display:none']").remove();
    
    // Remove empty paragraphs and divs
    doc.select("p:empty, div:empty, span:empty").remove();
    
    // Convert common patterns to more semantic HTML
    Elements elements = doc.select("*");
    for (Element element : elements) {
        String text = element.ownText();
        
        // Convert text that looks like headers
        if (isLikelyHeader(text, element)) {
            element.tagName("h3");
        }
        
        // Clean up excessive whitespace in text nodes
        if (!text.trim().isEmpty()) {
            element.text(text.replaceAll("\\s+", " ").trim());
        }
    }
    
    // Normalize list structures
    normalizeListStructures(doc);
    
    return doc.body().html();
}

Structured Section Extraction

public String extractStructuredDescription(Document doc, String... possibleSelectors) {
    if (doc == null) {
        return null;
    }
    
    StringBuilder description = new StringBuilder();
    
    // Try to find the main description element
    Element descriptionElement = null;
    for (String selector : possibleSelectors) {
        descriptionElement = doc.selectFirst(selector);
        if (descriptionElement != null) {
            break;
        }
    }
    
    if (descriptionElement == null) {
        logger.warn("Could not find description element using provided selectors");
        return null;
    }
    
    // Look for structured sections
    List<Section> sections = extractSections(descriptionElement);
    
    if (!sections.isEmpty()) {
        // We found structured content
        for (Section section : sections) {
            if (section.title != null && !section.title.trim().isEmpty()) {
                description.append("## ").append(section.title).append("\n\n");
            }
            
            String sectionMarkdown = convertToMarkdown(section.content);
            if (sectionMarkdown != null && !sectionMarkdown.trim().isEmpty()) {
                description.append(sectionMarkdown).append("\n\n");
            }
        }
    } else {
        // No structured sections found, convert the whole thing
        description.append(convertElementToMarkdown(descriptionElement));
    }
    
    return description.toString().trim();
}

Usage Examples

Basic HTML to Markdown Conversion

@Autowired
private DescriptionConverter descriptionConverter;

// Convert HTML job description to Markdown
String htmlContent = "<h2>Job Requirements</h2><ul><li>5+ years experience</li><li>Java expertise</li></ul>";
String markdown = descriptionConverter.convertToMarkdown(htmlContent);

// Result:
// ## Job Requirements
// 
// * 5+ years experience
// * Java expertise

Converting JSoup Elements

// When you already have a JSoup Element
Document doc = Jsoup.connect("https://job-site.com/job/123").get();
Element descriptionElement = doc.selectFirst(".job-description");

String markdown = descriptionConverter.convertElementToMarkdown(descriptionElement);

Structured Description Extraction

// Extract structured job description with multiple possible selectors
Document doc = Jsoup.connect("https://job-site.com/job/123").get();

String structuredMarkdown = descriptionConverter.extractStructuredDescription(
    doc,
    ".job-description",
    ".description-content",
    "#job-details"
);

Integration with Job Parsers

@Service
public class EnhancedJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    public JobApplication parseJob(String jobUrl) {
        Document doc = Jsoup.connect(jobUrl).get();
        
        // Extract basic job information
        String title = doc.selectFirst(".job-title").text();
        String company = doc.selectFirst(".company-name").text();
        
        // Convert description using the universal converter
        String description = descriptionConverter.extractStructuredDescription(
            doc,
            ".job-description",
            ".description"
        );
        
        return JobApplication.builder()
            .title(title)
            .company(company)
            .description(description)
            .build();
    }
}

Configuration

Flexmark Configuration Options

The service uses optimized Flexmark settings for job description conversion:

MutableDataSet options = new MutableDataSet();

// Use ATX headings (##) instead of Setext headings
options.set(FlexmarkHtmlConverter.SETEXT_HEADINGS, false);

// Skip unknown HTML tags for cleaner output
options.set(FlexmarkHtmlConverter.OUTPUT_UNKNOWN_TAGS, false);

// Keep simple quotes instead of typographic quotes
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_QUOTES, false);

// Keep simple punctuation
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_SMARTS, false);

// Don't wrap auto-detected links
options.set(FlexmarkHtmlConverter.WRAP_AUTO_LINKS, false);

// Skip HTML comments
options.set(FlexmarkHtmlConverter.RENDER_COMMENTS, false);

// Use . for numeric lists
options.set(FlexmarkHtmlConverter.DOT_ONLY_NUMERIC_LISTS, true);

Logging Configuration

Enable debug logging for conversion monitoring:

# Enable DescriptionConverter debug logging
logging.level.com.jnleyva.nextmove_backend.service.DescriptionConverter=DEBUG

# Enable HTML parsing debug information
logging.level.org.jsoup=WARN

Testing

Comprehensive Test Suite

The DescriptionConverter includes 100+ test cases covering various scenarios:

@ExtendWith(MockitoExtension.class)
class DescriptionConverterTest {
    
    private DescriptionConverter descriptionConverter;
    
    @BeforeEach
    void setUp() {
        descriptionConverter = new DescriptionConverter();
    }
    
    @Test
    void testBasicHtmlToMarkdown() {
        String html = "<h2>Title</h2><p>Description with <strong>bold</strong> text.</p>";
        String result = descriptionConverter.convertToMarkdown(html);
        
        assertThat(result).contains("## Title");
        assertThat(result).contains("**bold**");
    }
    
    @Test
    void testListConversion() {
        String html = "<ul><li>First item</li><li>Second item</li></ul>";
        String result = descriptionConverter.convertToMarkdown(html);
        
        assertThat(result).contains("* First item");
        assertThat(result).contains("* Second item");
    }
    
    @Test
    void testComplexJobDescription() {
        String html = loadTestHtml("complex-job-description.html");
        String result = descriptionConverter.convertToMarkdown(html);
        
        assertThat(result).isNotNull();
        assertThat(result).doesNotContain("<script>");
        assertThat(result).doesNotContain("<style>");
        assertThat(result.length()).isGreaterThan(100);
    }
}

Test Categories

Basic Conversion Tests: Simple HTML to Markdown conversion
Preprocessing Tests: HTML cleanup and preprocessing
Post-processing Tests: Markdown cleanup and formatting
Structured Extraction Tests: Section identification and extraction
Error Handling Tests: Fallback scenarios and error conditions
Performance Tests: Large content handling and performance metrics
Integration Tests: End-to-end testing with real job sites

Test Data

Test files include real-world examples from various job sites:

greenhouse-job-description.html - Greenhouse format job descriptions
meta-job-description.html - Meta/Facebook job postings
microsoft-job-description.html - Microsoft careers page content
complex-nested-structure.html - Complex nested HTML structures
malformed-html.html - Invalid or malformed HTML content

Integration

Job Parser Integration

All job parsers now use the DescriptionConverter for consistent formatting:

Greenhouse Job Parser

@Service
public class GreenhouseJobParser extends BaseJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    protected String extractDescription(Document doc) {
        return descriptionConverter.extractStructuredDescription(
            doc,
            "#content",
            ".application-description"
        );
    }
}

Microsoft Job Parser

@Service
public class MicrosoftJobParser extends BaseJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    protected String extractDescription(Document doc) {
        return descriptionConverter.extractStructuredDescription(
            doc,
            "[data-automation-id='jobPostingDescription']",
            ".job-description"
        );
    }
}

Meta Job Parser

@Service
public class MetaJobParser extends BaseJobParser {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    protected String extractDescription(Document doc) {
        return descriptionConverter.extractStructuredDescription(
            doc,
            "[data-testid='job-description']",
            ".job-description-content"
        );
    }
}

Dependency Configuration

Add Flexmark dependency to pom.xml:

<dependency>
    <groupId>com.vladsch.flexmark</groupId>
    <artifactId>flexmark-html2md-converter</artifactId>
    <version>0.64.8</version>
</dependency>

Troubleshooting

Common Issues

1. Poor Conversion Quality

Problem: HTML converts to poorly formatted Markdown Solution:

Check HTML preprocessing settings
Verify Flexmark configuration options
Review input HTML structure

2. Missing Content

Problem: Some HTML content is not appearing in Markdown Solution:

Check if content is being removed in preprocessing
Verify element selectors are correct
Enable debug logging to trace conversion steps

3. Performance Issues

Problem: Conversion is slow for large HTML content Solution:

Implement caching for repeated conversions
Optimize HTML preprocessing
Consider async processing for large batches

4. Encoding Issues

Problem: Special characters not displaying correctly Solution:

Ensure proper character encoding in HTML input
Check Flexmark encoding settings
Verify database column encoding

Debug Configuration

// Enable detailed debug logging
logger.debug("Input HTML length: {}", htmlContent.length());
logger.debug("Cleaned HTML length: {}", cleanedHtml.length());
logger.debug("Final Markdown length: {}", finalMarkdown.length());

// Log conversion steps
if (logger.isDebugEnabled()) {
    logger.debug("Preprocessing removed {} characters", 
                htmlContent.length() - cleanedHtml.length());
    logger.debug("Conversion result preview: {}", 
                finalMarkdown.substring(0, Math.min(200, finalMarkdown.length())));
}

Health Monitoring

@Component
public class DescriptionConverterHealthIndicator implements HealthIndicator {
    
    @Autowired
    private DescriptionConverter descriptionConverter;
    
    @Override
    public Health health() {
        try {
            // Test basic conversion functionality
            String testHtml = "<h2>Test</h2><p>Test content</p>";
            String result = descriptionConverter.convertToMarkdown(testHtml);
            
            if (result != null && result.contains("## Test")) {
                return Health.up()
                    .withDetail("converter", "operational")
                    .withDetail("flexmark", "ready")
                    .build();
            } else {
                return Health.down()
                    .withDetail("converter", "conversion_failed")
                    .build();
            }
        } catch (Exception e) {
            return Health.down()
                .withDetail("converter", "error")
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

Performance

Optimization Strategies

Caching: Cache converted content to avoid repeated processing
Async Processing: Process large batches asynchronously
Memory Management: Optimize memory usage for large HTML content
Preprocessing Optimization: Minimize DOM manipulation operations

Performance Metrics

Typical conversion performance:

Small HTML (< 1KB): < 5ms
Medium HTML (1-10KB): 5-25ms
Large HTML (10-100KB): 25-100ms
Very Large HTML (> 100KB): 100-500ms

Benchmarking

@Test
void performanceTest() {
    String largeHtml = loadTestHtml("large-job-description.html");
    
    long startTime = System.currentTimeMillis();
    String result = descriptionConverter.convertToMarkdown(largeHtml);
    long endTime = System.currentTimeMillis();
    
    long conversionTime = endTime - startTime;
    
    assertThat(conversionTime).isLessThan(1000); // Should complete in < 1 second
    assertThat(result).isNotNull();
    assertThat(result.length()).isGreaterThan(0);
}

Future Enhancements

Planned Features

Custom Markdown Templates: Configurable output templates for different job types
AI-Powered Section Detection: Use ML to improve section identification
Multi-language Support: Support for non-English job descriptions
Custom Styling: Configurable Markdown styling options
Batch Processing: Optimized batch conversion for multiple jobs
Analytics: Conversion quality metrics and analytics

Advanced Configuration

Plugin System: Extensible plugin architecture for custom processors
Rule Engine: Configurable rules for HTML preprocessing
Template Engine: Custom templates for different job sites
Quality Scoring: Automatic quality assessment of conversions

📝 Note: The DescriptionConverter is designed to be maintainable and extensible. All job parsers should use this service for consistent formatting across the platform.