Data Extraction - JinsongRoh/pydoll-mcp GitHub Wiki

📊 Data Extraction - Web Data Extraction Methods

PyDoll MCP Server provides powerful tools for extracting data from modern websites. This guide demonstrates how to effectively extract data in various scenarios.

🎯 Data Extraction Tools Overview

Basic Extraction Tools

  • 🔍 element_tools: Find page elements and extract text
  • 📝 script_tools: Dynamic data extraction through JavaScript execution
  • 🌐 advanced_tools: Advanced analysis and network monitoring

Key Features

  • 🚫 Zero WebDriver: Direct Chrome DevTools Protocol usage
  • 🛡️ Security Bypass: Automatic Cloudflare, reCAPTCHA resolution
  • 🔄 Dynamic Content: Data extraction after JavaScript rendering
  • 📡 Network Monitoring: API call and response capture

🛠️ Data Extraction Methods

1. Basic Text Extraction

Single Element Text Extraction

"Extract the title of this page"
"Find the price information and get it as text"
"Extract the text from the product description section"

Multiple Element Text Extraction

"Extract all product names"
"Get all link text from the page"
"Extract all text from the review list"

2. Attribute Value Extraction

Link URL Extraction

"Extract all link URLs from this page"
"Get the src attribute of product images"
"Extract the href attribute of download links"

Data Attribute Extraction

"Extract all values of data-price attributes"
"Get data from elements whose id attributes start with 'product-'"

3. Structured Data Extraction

Table Data Extraction

// Table data extraction using JavaScript
const tableData = [];
const rows = document.querySelectorAll('table tbody tr');

rows.forEach(row => {
    const cells = row.querySelectorAll('td');
    const rowData = {
        name: cells[0]?.textContent.trim(),
        price: cells[1]?.textContent.trim(),
        description: cells[2]?.textContent.trim()
    };
    tableData.push(rowData);
});

return tableData;

Card-style Data Extraction

// Product card data extraction
const products = [];
const cards = document.querySelectorAll('.product-card');

cards.forEach(card => {
    const product = {
        title: card.querySelector('.product-title')?.textContent.trim(),
        price: card.querySelector('.price')?.textContent.trim(),
        image: card.querySelector('img')?.src,
        rating: card.querySelector('.rating')?.textContent.trim(),
        link: card.querySelector('a')?.href
    };
    products.push(product);
});

return products;

4. Dynamic Content Extraction

Infinite Scroll Handling

"Scroll to the end of the page while extracting all product data"
"Collect all content that loads as you scroll"

AJAX Loading Wait

// Wait for AJAX content loading then extract
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract dynamically loaded content
const dynamicContent = document.querySelectorAll('.dynamic-content');
const data = Array.from(dynamicContent).map(el => ({
    text: el.textContent.trim(),
    html: el.innerHTML
}));

return data;

5. Network API Data Extraction

API Response Monitoring

"Start network request monitoring and collect API response data"
"Capture API calls that occur when loading search results"

Specific API Endpoint Tracking

"Monitor API requests to the '/api/products' path"
"Extract product data from JSON responses"

🔧 Advanced Extraction Techniques

1. Custom Data Extractor

Custom Extraction Rules

// Complex data extraction function
function extractPageData() {
    const data = {
        meta: {
            title: document.title,
            url: window.location.href,
            timestamp: new Date().toISOString()
        },
        content: {
            headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
                level: h.tagName.toLowerCase(),
                text: h.textContent.trim()
            })),
            paragraphs: Array.from(document.querySelectorAll('p')).map(p => p.textContent.trim()),
            links: Array.from(document.querySelectorAll('a[href]')).map(a => ({
                text: a.textContent.trim(),
                url: a.href
            })),
            images: Array.from(document.querySelectorAll('img')).map(img => ({
                src: img.src,
                alt: img.alt
            }))
        }
    };
    
    return data;
}

return extractPageData();

2. Form Data Extraction

Input Field Analysis

// Extract form field information
const forms = Array.from(document.querySelectorAll('form'));
const formData = forms.map(form => ({
    action: form.action,
    method: form.method,
    fields: Array.from(form.querySelectorAll('input, select, textarea')).map(field => ({
        name: field.name,
        type: field.type,
        placeholder: field.placeholder,
        required: field.required
    }))
}));

return formData;

3. Performance Metrics Extraction

Page Loading Performance Data

"Run page performance analysis and extract loading time data"
"Measure network request response times and analyze them"

📋 Real-World Usage Examples

E-commerce Site Data Extraction

1. "Start browser and navigate to https://example-shop.com"
2. "Navigate to the product listing page"
3. "Extract all product names, prices, and image URLs"
4. "Navigate to next pages and collect all product data"

News Site Content Extraction

1. "Open news site and navigate to main page"
2. "Extract headlines, summaries, and publication dates"
3. "Click on each article link to collect full content"

Social Media Data Extraction

1. "Log in to social media site"
2. "Scroll through timeline and collect post data"
3. "Extract likes, comments, and share counts together"

🛡️ Security Bypass and Special Situation Handling

Cloudflare Protected Sites

"Enable Cloudflare protection bypass and access protected site"
"Automatically solve Turnstile challenges and extract data"

reCAPTCHA Handling

"Automatically solve reCAPTCHA when it appears and proceed"
"Extract data from hidden content areas after authentication"

Login Required Sites

"Find login form, automatically log in, and extract data"
"Maintain session while collecting data from multiple pages"

📊 Data Formatting and Export

Structure Data in JSON Format

const structuredData = {
    metadata: {
        source: window.location.href,
        extractedAt: new Date().toISOString(),
        totalItems: items.length
    },
    data: items.map(item => ({
        id: item.id,
        title: item.title,
        content: item.content,
        attributes: {
            price: item.price,
            category: item.category
        }
    }))
};

return structuredData;

Prepare Data in CSV Format

const csvData = items.map(item => ({
    'Product Name': item.name,
    'Price': item.price,
    'Category': item.category,
    'Rating': item.rating,
    'URL': item.url
}));

return csvData;

🎯 Best Practices

1. Efficient Extraction Strategy

  • Selector Optimization: Use specific and stable CSS selectors
  • Wait Time Settings: Extract after content loading completion
  • Error Handling: Exception handling for missing elements

2. Performance Optimization

  • Batch Processing: Extract multiple elements at once
  • Memory Management: Process large datasets in chunks
  • Network Efficiency: Block unnecessary resources

3. Data Quality Management

  • Validation: Validate extracted data
  • Normalization: Maintain consistent data formats
  • Deduplication: Filter duplicate data

🔍 Troubleshooting

Common Issues

Element Not Found

"Wait for element to load then try again"
"Try finding element using different selectors or XPath"

Dynamic Content Loading Failure

"Wait for page loading completion before attempting extraction"
"Check if content is generated after JavaScript execution"

Access Blocked

"Enable stealth mode and evade detection"
"Change user agent and retry"

📚 Additional Resources


Use this guide to maximize PyDoll MCP Server's powerful data extraction capabilities. If you need additional questions or advanced usage, feel free to contact us anytime! 🚀