Use case of Unstructured Data | Restaurant Review - rahul7838/quora.clone GitHub Wiki
In Unstructured data, like below, which include rating, text, attachments.
⭐⭐⭐⭐
Visited Joe's Pizza yesterday. Amazing margherita! 🍕
Service was a bit slow but worth the wait.
Price: $$
#foodie #nycfood
Uploaded: 3 photos of pizza
the text can have emoji, hashtags and other things.
So, we need to process this data. So where should we process is explained below:
Let me explain the pros and cons of processing review data at client vs server side, and the recommended approach:
CLIENT-SIDE PROCESSING:
Pros:
- Reduces server load
- Faster feedback to user
- Can show real-time previews
Cons:
- JavaScript can be disabled/manipulated
- Different browsers might process differently
- Security concerns (can't trust client data)
- Mobile devices might have limited processing power
- Code duplication across platforms (web, mobile, etc.)
SERVER-SIDE PROCESSING:
Pros:
- Consistent processing logic
- More secure
- Access to better NLP/AI tools
- Easier to update algorithms
- Can handle heavy processing
- Single source of truth
Cons:
- Higher server load
- Slightly slower response time
- More server resources needed
RECOMMENDED APPROACH: Do minimal processing on client, main processing on server.
- Client-side:
// Basic validation and formatting
function prepareReviewData(userInput) {
return {
text: userInput.text.trim(),
rating: parseInt(userInput.rating),
photos: userInput.photos.map(photo => ({
file: photo,
size: photo.size,
type: photo.type
})),
visitDate: new Date(userInput.visitDate),
clientTimestamp: new Date()
};
}
- Server-side:
async function processReview(rawReview) {
// 1. Extract components
const processed = {
...rawReview,
analysis: {
hashtags: extractHashtags(rawReview.text),
emojis: extractEmojis(rawReview.text),
topics: await analyzeTopics(rawReview.text),
sentiment: await analyzeSentiment(rawReview.text),
keywords: await extractKeywords(rawReview.text)
}
};
// 2. Process images
if (rawReview.photos) {
processed.photos = await Promise.all(
rawReview.photos.map(processPhoto)
);
}
// 3. Store processed data
const storedReview = await saveToDatabase(processed);
await indexForSearch(processed);
return storedReview;
}
- API Flow:
app.post('/api/reviews', async (req, res) => {
try {
// 1. Basic validation
const rawReview = validateReview(req.body);
// 2. Process review
const processedReview = await processReview(rawReview);
// 3. Return processed review
res.json({
success: true,
review: processedReview
});
} catch (error) {
res.status(400).json({
success: false,
error: error.message
});
}
});
WHY THIS APPROACH IS BETTER:
- Security:
- Server can validate all data
- Can't bypass processing rules
- Can apply rate limiting
- Consistency:
- Same processing for all clients
- Easier to maintain one codebase
- Uniform data quality
- Performance:
- Can use powerful server resources
- Better NLP/AI tools available
- Async processing possible
- Maintainability:
- Single place to update algorithms
- Easier to debug
- Better monitoring
- Flexibility:
- Can change processing without client updates
- Can A/B test different algorithms
- Can add new features easily
Remember: "Never trust client-side data" is a good security principle to follow!
We should not Make users wait for data processing. Let me explain different approaches:
- Using Message Queue (RabbitMQ):
// API Endpoint
app.post('/api/reviews', async (req, res) => {
try {
// 1. Store raw data in MongoDB first
const rawReview = await db.rawReviews.insertOne({
userId: req.user.id,
content: req.body,
status: 'PENDING',
createdAt: new Date()
});
// 2. Send to queue for processing
await rabbitMQ.publish('review-queue', {
reviewId: rawReview._id,
content: req.body
});
// 3. Quick response to user
res.json({
success: true,
message: "Review submitted successfully",
reviewId: rawReview._id
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
// Consumer Service
rabbitMQ.consume('review-queue', async (message) => {
const { reviewId, content } = message;
try {
// Process review
const processedData = await processReview(content);
// Update MongoDB
await db.reviews.insertOne(processedData);
await db.rawReviews.updateOne(
{ _id: reviewId },
{ $set: { status: 'PROCESSED' } }
);
// Index in Elasticsearch
await indexToElasticsearch(processedData);
} catch (error) {
// Handle error, maybe retry
await db.rawReviews.updateOne(
{ _id: reviewId },
{ $set: { status: 'FAILED', error: error.message } }
);
}
});
- Using Kafka (Event Streaming):
// Producer (API)
app.post('/api/reviews', async (req, res) => {
try {
const rawReview = await db.rawReviews.insertOne({
userId: req.user.id,
content: req.body,
status: 'PENDING'
});
// Publish to Kafka
await kafka.send({
topic: 'reviews',
messages: [{
key: rawReview._id.toString(),
value: JSON.stringify({
reviewId: rawReview._id,
content: req.body
})
}]
});
res.json({ success: true, reviewId: rawReview._id });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
// Consumer
kafka.subscribe('reviews').on('message', async (message) => {
const { reviewId, content } = JSON.parse(message.value);
// Process similar to RabbitMQ example
});
- Using Redis for Quick Access:
app.post('/api/reviews', async (req, res) => {
try {
// Store in Redis first (for quick access)
const reviewId = generateId();
await redis.set(`review:${reviewId}`, JSON.stringify({
content: req.body,
status: 'PENDING'
}), 'EX', 3600); // expire in 1 hour
// Queue for processing
await queue.add('process-review', {
reviewId,
content: req.body
});
res.json({ success: true, reviewId });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
Which Approach to Choose?
- RabbitMQ when:
- Need guaranteed message delivery
- Processing order matters
- Simple queue functionality needed
- Lower latency is priority
- Kafka when:
- Need event streaming/replay
- High throughput required
- Want to maintain event history
- Multiple consumers needed
- Data replication important
- Redis + Queue when:
- Need very fast initial storage
- Want to show unprocessed reviews quickly
- Temporary storage is acceptable
- Simple queue setup preferred
My Recommendation: For a review system, I'd recommend Kafka because:
- Can replay events if processing fails
- Multiple services can consume events (processing, analytics, notifications)
- Scales well for high volume
- Good for future expansion (like analytics)
Example Architecture:
User -> API -> MongoDB (raw) + Kafka
↓
Processing Service
↓
MongoDB (processed) + Elasticsearch
This gives you:
- Instant user feedback
- Reliable processing
- Scalability
- Data consistency
- Future extensibility