Datasets - halfpintutopia/DA-deutsch-englisch-einfluss GitHub Wiki

🧠 NLP Project Column Design Cheatsheet

A checklist to design dataset schema before scraping.

Note: Thinking ahead to how you’ll analyze the data helps avoid repetitive scraping.

✅ Core Categories

🏷 Metadata (Source Info)

Column Name	Purpose / Use Case
`url`	Trace source for validation or re-scraping
`source_site`	Group/filter by platform (e.g., chip.de, reddit.com)
`0`	Categorize by content type (tech, business, lifestyle)

🕒 Time

Column Name	Purpose / Use Case
`date`	Use for time-series trends
`year`	Easily group/filter by year

📄 Content Info

Column Name	Purpose / Use Case
`text`	Main content for analysis
`title`	Optional: analyze headline trends
`word_count`	Normalize features (e.g., per 100 words)
`paragraphs`	Filter for article completeness/length

💬 Feature-Specific (Loanwords)

Column Name	Purpose / Use Case
`loanwords`	Full list of detected loanwords
`all_loanwords`	Unique set for variety/density
`loanword_count`	Raw number of loanwords used
`loanword_density`	Normalized measure (loanwords/words)
`top_loanwords`	Compare common usage per domain/article

💡 Emotional / Semantic

Column Name	Purpose / Use Case
`sentiment`	Compare tone vs. language borrowing

🔧 Optional / Advanced (Derived Columns)

Column Name	Purpose / Use Case
`contains_denglisch`	Boolean flag if loanword density > threshold
`loanword_score`	Weighted score (count × diversity)
`language_ratio`	Percent of English tokens in full text

💡 How to Use It

🧠 Ask:
- What will I want to compare? filter by? Normalize?
- 🔍 How do I want to slice the dataset (time, tone, domain, etc.)