Datasets - halfpintutopia/DA-deutsch-englisch-einfluss GitHub Wiki
🧠 NLP Project Column Design Cheatsheet
A checklist to design dataset schema before scraping.
Note: Thinking ahead to how you’ll analyze the data helps avoid repetitive scraping.
✅ Core Categories
🏷 Metadata (Source Info)
Column Name |
Purpose / Use Case |
url |
Trace source for validation or re-scraping |
source_site |
Group/filter by platform (e.g., chip.de, reddit.com) |
0 |
Categorize by content type (tech, business, lifestyle) |
🕒 Time
Column Name |
Purpose / Use Case |
date |
Use for time-series trends |
year |
Easily group/filter by year |
📄 Content Info
Column Name |
Purpose / Use Case |
text |
Main content for analysis |
title |
Optional: analyze headline trends |
word_count |
Normalize features (e.g., per 100 words) |
paragraphs |
Filter for article completeness/length |
💬 Feature-Specific (Loanwords)
Column Name |
Purpose / Use Case |
loanwords |
Full list of detected loanwords |
all_loanwords |
Unique set for variety/density |
loanword_count |
Raw number of loanwords used |
loanword_density |
Normalized measure (loanwords/words) |
top_loanwords |
Compare common usage per domain/article |
💡 Emotional / Semantic
Column Name |
Purpose / Use Case |
sentiment |
Compare tone vs. language borrowing |
🔧 Optional / Advanced (Derived Columns)
Column Name |
Purpose / Use Case |
contains_denglisch |
Boolean flag if loanword density > threshold |
loanword_score |
Weighted score (count × diversity) |
language_ratio |
Percent of English tokens in full text |
💡 How to Use It
- 🧠 Ask:
- What will I want to compare? filter by? Normalize?
- 🔍 How do I want to slice the dataset (time, tone, domain, etc.)