Markdown Converter Guide - jnleyva816/NextMove GitHub Wiki
📝 Universal HTML to Markdown conversion system for clean job descriptions
This guide covers the DescriptionConverter service implementation, which provides robust HTML to Markdown conversion for all job parsers in NextMove.
- Overview
- Architecture
- Key Features
- Implementation Details
- Usage Examples
- Configuration
- Testing
- Integration
- Troubleshooting
- Performance
The DescriptionConverter service is a universal HTML to Markdown conversion system designed to transform messy HTML job descriptions into clean, readable Markdown format. This service is used by all job parsers to ensure consistent formatting across different job sites.
Before the DescriptionConverter:
- ❌ Inconsistent job description formatting across different sites
- ❌ Raw HTML content difficult to read and process
- ❌ Poor user experience with cluttered job descriptions
- ❌ Manual formatting required for each job parser
After the DescriptionConverter:
- ✅ Consistent, clean Markdown formatting for all job descriptions
- ✅ Automatic HTML cleanup and preprocessing
- ✅ Structured section extraction and formatting
- ✅ Universal service used by all parsers
flowchart TD
A[Raw HTML Content] --> B[DescriptionConverter Service]
B --> C[HTML Preprocessing]
C --> D[Flexmark HTML to Markdown]
D --> E[Markdown Post-processing]
E --> F[Clean Markdown Output]
C --> G[Remove Unwanted Elements]
C --> H[Normalize List Structures]
C --> I[Identify Headers]
E --> J[Clean Multiple Newlines]
E --> K[Fix Formatting Issues]
E --> L[Structure Sections]
- HTML Preprocessor: Cleans and prepares HTML content
- Flexmark Converter: Professional HTML to Markdown conversion
- Markdown Post-processor: Polishes the final output
- Section Extractor: Identifies and structures job description sections
- Fallback Handler: Provides plain text extraction when conversion fails
- Element Removal: Removes scripts, styles, navigation, and hidden elements
- Empty Element Cleanup: Removes empty paragraphs and divs
- Header Detection: Automatically identifies text that should be headers
- List Normalization: Standardizes list structures for better conversion
- Professional Conversion: Uses Flexmark library for robust HTML to Markdown conversion
- Configurable Options: Optimized settings for job description formatting
- Emphasis Preservation: Maintains important formatting like bold and italic text
- Link Handling: Properly converts links and maintains readability
- Whitespace Cleanup: Removes excessive newlines and trailing spaces
- Header Normalization: Ensures consistent header formatting
- Section Structure: Organizes content into logical sections
- Duplicate Removal: Eliminates duplicate headers and content
- Section Identification: Automatically identifies common job description sections
- Smart Parsing: Handles various HTML structures and layouts
- Fallback Support: Graceful degradation when structured extraction fails
@Service
public class DescriptionConverter {
private static final Logger logger = LoggerFactory.getLogger(DescriptionConverter.class);
private final FlexmarkHtmlConverter htmlToMarkdownConverter;
// Regex patterns for cleanup
private static final Pattern MULTIPLE_NEWLINES = Pattern.compile("\\n{3,}");
private static final Pattern TRAILING_SPACES = Pattern.compile(" +$", Pattern.MULTILINE);
private static final Pattern EMPTY_HEADERS = Pattern.compile("^#+\\s*$", Pattern.MULTILINE);
public DescriptionConverter() {
// Configure Flexmark options for optimal conversion
MutableDataSet options = new MutableDataSet();
options.set(FlexmarkHtmlConverter.SETEXT_HEADINGS, false);
options.set(FlexmarkHtmlConverter.OUTPUT_UNKNOWN_TAGS, false);
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_QUOTES, false);
this.htmlToMarkdownConverter = FlexmarkHtmlConverter.builder(options).build();
}
}
public String convertToMarkdown(String htmlContent) {
if (htmlContent == null || htmlContent.trim().isEmpty()) {
return null;
}
try {
logger.debug("Converting HTML content to Markdown, length: {}", htmlContent.length());
// Step 1: Clean and preprocess the HTML
String cleanedHtml = preprocessHtml(htmlContent);
// Step 2: Convert to Markdown
String markdown = htmlToMarkdownConverter.convert(cleanedHtml);
// Step 3: Post-process the Markdown for better readability
String finalMarkdown = postProcessMarkdown(markdown);
logger.debug("Conversion complete, Markdown length: {}", finalMarkdown.length());
return finalMarkdown;
} catch (Exception e) {
logger.error("Error converting HTML to Markdown", e);
return extractPlainTextFallback(htmlContent);
}
}
private String preprocessHtml(String html) {
Document doc = Jsoup.parse(html);
// Remove unwanted elements
doc.select("script, style, nav, header, footer, .hidden, [style*='display:none']").remove();
// Remove empty paragraphs and divs
doc.select("p:empty, div:empty, span:empty").remove();
// Convert common patterns to more semantic HTML
Elements elements = doc.select("*");
for (Element element : elements) {
String text = element.ownText();
// Convert text that looks like headers
if (isLikelyHeader(text, element)) {
element.tagName("h3");
}
// Clean up excessive whitespace in text nodes
if (!text.trim().isEmpty()) {
element.text(text.replaceAll("\\s+", " ").trim());
}
}
// Normalize list structures
normalizeListStructures(doc);
return doc.body().html();
}
public String extractStructuredDescription(Document doc, String... possibleSelectors) {
if (doc == null) {
return null;
}
StringBuilder description = new StringBuilder();
// Try to find the main description element
Element descriptionElement = null;
for (String selector : possibleSelectors) {
descriptionElement = doc.selectFirst(selector);
if (descriptionElement != null) {
break;
}
}
if (descriptionElement == null) {
logger.warn("Could not find description element using provided selectors");
return null;
}
// Look for structured sections
List<Section> sections = extractSections(descriptionElement);
if (!sections.isEmpty()) {
// We found structured content
for (Section section : sections) {
if (section.title != null && !section.title.trim().isEmpty()) {
description.append("## ").append(section.title).append("\n\n");
}
String sectionMarkdown = convertToMarkdown(section.content);
if (sectionMarkdown != null && !sectionMarkdown.trim().isEmpty()) {
description.append(sectionMarkdown).append("\n\n");
}
}
} else {
// No structured sections found, convert the whole thing
description.append(convertElementToMarkdown(descriptionElement));
}
return description.toString().trim();
}
@Autowired
private DescriptionConverter descriptionConverter;
// Convert HTML job description to Markdown
String htmlContent = "<h2>Job Requirements</h2><ul><li>5+ years experience</li><li>Java expertise</li></ul>";
String markdown = descriptionConverter.convertToMarkdown(htmlContent);
// Result:
// ## Job Requirements
//
// * 5+ years experience
// * Java expertise
// When you already have a JSoup Element
Document doc = Jsoup.connect("https://job-site.com/job/123").get();
Element descriptionElement = doc.selectFirst(".job-description");
String markdown = descriptionConverter.convertElementToMarkdown(descriptionElement);
// Extract structured job description with multiple possible selectors
Document doc = Jsoup.connect("https://job-site.com/job/123").get();
String structuredMarkdown = descriptionConverter.extractStructuredDescription(
doc,
".job-description",
".description-content",
"#job-details"
);
@Service
public class EnhancedJobParser {
@Autowired
private DescriptionConverter descriptionConverter;
public JobApplication parseJob(String jobUrl) {
Document doc = Jsoup.connect(jobUrl).get();
// Extract basic job information
String title = doc.selectFirst(".job-title").text();
String company = doc.selectFirst(".company-name").text();
// Convert description using the universal converter
String description = descriptionConverter.extractStructuredDescription(
doc,
".job-description",
".description"
);
return JobApplication.builder()
.title(title)
.company(company)
.description(description)
.build();
}
}
The service uses optimized Flexmark settings for job description conversion:
MutableDataSet options = new MutableDataSet();
// Use ATX headings (##) instead of Setext headings
options.set(FlexmarkHtmlConverter.SETEXT_HEADINGS, false);
// Skip unknown HTML tags for cleaner output
options.set(FlexmarkHtmlConverter.OUTPUT_UNKNOWN_TAGS, false);
// Keep simple quotes instead of typographic quotes
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_QUOTES, false);
// Keep simple punctuation
options.set(FlexmarkHtmlConverter.TYPOGRAPHIC_SMARTS, false);
// Don't wrap auto-detected links
options.set(FlexmarkHtmlConverter.WRAP_AUTO_LINKS, false);
// Skip HTML comments
options.set(FlexmarkHtmlConverter.RENDER_COMMENTS, false);
// Use . for numeric lists
options.set(FlexmarkHtmlConverter.DOT_ONLY_NUMERIC_LISTS, true);
Enable debug logging for conversion monitoring:
# Enable DescriptionConverter debug logging
logging.level.com.jnleyva.nextmove_backend.service.DescriptionConverter=DEBUG
# Enable HTML parsing debug information
logging.level.org.jsoup=WARN
The DescriptionConverter includes 100+ test cases covering various scenarios:
@ExtendWith(MockitoExtension.class)
class DescriptionConverterTest {
private DescriptionConverter descriptionConverter;
@BeforeEach
void setUp() {
descriptionConverter = new DescriptionConverter();
}
@Test
void testBasicHtmlToMarkdown() {
String html = "<h2>Title</h2><p>Description with <strong>bold</strong> text.</p>";
String result = descriptionConverter.convertToMarkdown(html);
assertThat(result).contains("## Title");
assertThat(result).contains("**bold**");
}
@Test
void testListConversion() {
String html = "<ul><li>First item</li><li>Second item</li></ul>";
String result = descriptionConverter.convertToMarkdown(html);
assertThat(result).contains("* First item");
assertThat(result).contains("* Second item");
}
@Test
void testComplexJobDescription() {
String html = loadTestHtml("complex-job-description.html");
String result = descriptionConverter.convertToMarkdown(html);
assertThat(result).isNotNull();
assertThat(result).doesNotContain("<script>");
assertThat(result).doesNotContain("<style>");
assertThat(result.length()).isGreaterThan(100);
}
}
- Basic Conversion Tests: Simple HTML to Markdown conversion
- Preprocessing Tests: HTML cleanup and preprocessing
- Post-processing Tests: Markdown cleanup and formatting
- Structured Extraction Tests: Section identification and extraction
- Error Handling Tests: Fallback scenarios and error conditions
- Performance Tests: Large content handling and performance metrics
- Integration Tests: End-to-end testing with real job sites
Test files include real-world examples from various job sites:
-
greenhouse-job-description.html
- Greenhouse format job descriptions -
meta-job-description.html
- Meta/Facebook job postings -
microsoft-job-description.html
- Microsoft careers page content -
complex-nested-structure.html
- Complex nested HTML structures -
malformed-html.html
- Invalid or malformed HTML content
All job parsers now use the DescriptionConverter for consistent formatting:
@Service
public class GreenhouseJobParser extends BaseJobParser {
@Autowired
private DescriptionConverter descriptionConverter;
@Override
protected String extractDescription(Document doc) {
return descriptionConverter.extractStructuredDescription(
doc,
"#content",
".application-description"
);
}
}
@Service
public class MicrosoftJobParser extends BaseJobParser {
@Autowired
private DescriptionConverter descriptionConverter;
@Override
protected String extractDescription(Document doc) {
return descriptionConverter.extractStructuredDescription(
doc,
"[data-automation-id='jobPostingDescription']",
".job-description"
);
}
}
@Service
public class MetaJobParser extends BaseJobParser {
@Autowired
private DescriptionConverter descriptionConverter;
@Override
protected String extractDescription(Document doc) {
return descriptionConverter.extractStructuredDescription(
doc,
"[data-testid='job-description']",
".job-description-content"
);
}
}
Add Flexmark dependency to pom.xml
:
<dependency>
<groupId>com.vladsch.flexmark</groupId>
<artifactId>flexmark-html2md-converter</artifactId>
<version>0.64.8</version>
</dependency>
Problem: HTML converts to poorly formatted Markdown Solution:
- Check HTML preprocessing settings
- Verify Flexmark configuration options
- Review input HTML structure
Problem: Some HTML content is not appearing in Markdown Solution:
- Check if content is being removed in preprocessing
- Verify element selectors are correct
- Enable debug logging to trace conversion steps
Problem: Conversion is slow for large HTML content Solution:
- Implement caching for repeated conversions
- Optimize HTML preprocessing
- Consider async processing for large batches
Problem: Special characters not displaying correctly Solution:
- Ensure proper character encoding in HTML input
- Check Flexmark encoding settings
- Verify database column encoding
// Enable detailed debug logging
logger.debug("Input HTML length: {}", htmlContent.length());
logger.debug("Cleaned HTML length: {}", cleanedHtml.length());
logger.debug("Final Markdown length: {}", finalMarkdown.length());
// Log conversion steps
if (logger.isDebugEnabled()) {
logger.debug("Preprocessing removed {} characters",
htmlContent.length() - cleanedHtml.length());
logger.debug("Conversion result preview: {}",
finalMarkdown.substring(0, Math.min(200, finalMarkdown.length())));
}
@Component
public class DescriptionConverterHealthIndicator implements HealthIndicator {
@Autowired
private DescriptionConverter descriptionConverter;
@Override
public Health health() {
try {
// Test basic conversion functionality
String testHtml = "<h2>Test</h2><p>Test content</p>";
String result = descriptionConverter.convertToMarkdown(testHtml);
if (result != null && result.contains("## Test")) {
return Health.up()
.withDetail("converter", "operational")
.withDetail("flexmark", "ready")
.build();
} else {
return Health.down()
.withDetail("converter", "conversion_failed")
.build();
}
} catch (Exception e) {
return Health.down()
.withDetail("converter", "error")
.withDetail("error", e.getMessage())
.build();
}
}
}
- Caching: Cache converted content to avoid repeated processing
- Async Processing: Process large batches asynchronously
- Memory Management: Optimize memory usage for large HTML content
- Preprocessing Optimization: Minimize DOM manipulation operations
Typical conversion performance:
- Small HTML (< 1KB): < 5ms
- Medium HTML (1-10KB): 5-25ms
- Large HTML (10-100KB): 25-100ms
- Very Large HTML (> 100KB): 100-500ms
@Test
void performanceTest() {
String largeHtml = loadTestHtml("large-job-description.html");
long startTime = System.currentTimeMillis();
String result = descriptionConverter.convertToMarkdown(largeHtml);
long endTime = System.currentTimeMillis();
long conversionTime = endTime - startTime;
assertThat(conversionTime).isLessThan(1000); // Should complete in < 1 second
assertThat(result).isNotNull();
assertThat(result.length()).isGreaterThan(0);
}
- Custom Markdown Templates: Configurable output templates for different job types
- AI-Powered Section Detection: Use ML to improve section identification
- Multi-language Support: Support for non-English job descriptions
- Custom Styling: Configurable Markdown styling options
- Batch Processing: Optimized batch conversion for multiple jobs
- Analytics: Conversion quality metrics and analytics
- Plugin System: Extensible plugin architecture for custom processors
- Rule Engine: Configurable rules for HTML preprocessing
- Template Engine: Custom templates for different job sites
- Quality Scoring: Automatic quality assessment of conversions
📝 Note: The DescriptionConverter is designed to be maintainable and extensible. All job parsers should use this service for consistent formatting across the platform.