Datasets - halfpintutopia/DA-deutsch-englisch-einfluss GitHub Wiki

🧠 NLP Project Column Design Cheatsheet

A checklist to design dataset schema before scraping.

Note: Thinking ahead to how you’ll analyze the data helps avoid repetitive scraping.


✅ Core Categories

🏷 Metadata (Source Info)

Column Name Purpose / Use Case
url Trace source for validation or re-scraping
source_site Group/filter by platform (e.g., chip.de, reddit.com)
0 Categorize by content type (tech, business, lifestyle)

🕒 Time

Column Name Purpose / Use Case
date Use for time-series trends
year Easily group/filter by year

📄 Content Info

Column Name Purpose / Use Case
text Main content for analysis
title Optional: analyze headline trends
word_count Normalize features (e.g., per 100 words)
paragraphs Filter for article completeness/length

💬 Feature-Specific (Loanwords)

Column Name Purpose / Use Case
loanwords Full list of detected loanwords
all_loanwords Unique set for variety/density
loanword_count Raw number of loanwords used
loanword_density Normalized measure (loanwords/words)
top_loanwords Compare common usage per domain/article

💡 Emotional / Semantic

Column Name Purpose / Use Case
sentiment Compare tone vs. language borrowing

🔧 Optional / Advanced (Derived Columns)

Column Name Purpose / Use Case
contains_denglisch Boolean flag if loanword density > threshold
loanword_score Weighted score (count × diversity)
language_ratio Percent of English tokens in full text

💡 How to Use It

  • 🧠 Ask:
    • What will I want to compare? filter by? Normalize?
    • 🔍 How do I want to slice the dataset (time, tone, domain, etc.)