Use case of Unstructured Data | Restaurant Review - rahul7838/quora.clone GitHub Wiki

In Unstructured data, like below, which include rating, text, attachments.

⭐⭐⭐⭐ 
Visited Joe's Pizza yesterday. Amazing margherita! 🍕
Service was a bit slow but worth the wait.
Price: $$ 
#foodie #nycfood
Uploaded: 3 photos of pizza

the text can have emoji, hashtags and other things.

So, we need to process this data. So where should we process is explained below:

Let me explain the pros and cons of processing review data at client vs server side, and the recommended approach:

CLIENT-SIDE PROCESSING:

Pros:

  • Reduces server load
  • Faster feedback to user
  • Can show real-time previews

Cons:

  • JavaScript can be disabled/manipulated
  • Different browsers might process differently
  • Security concerns (can't trust client data)
  • Mobile devices might have limited processing power
  • Code duplication across platforms (web, mobile, etc.)

SERVER-SIDE PROCESSING:

Pros:

  • Consistent processing logic
  • More secure
  • Access to better NLP/AI tools
  • Easier to update algorithms
  • Can handle heavy processing
  • Single source of truth

Cons:

  • Higher server load
  • Slightly slower response time
  • More server resources needed

RECOMMENDED APPROACH: Do minimal processing on client, main processing on server.

  1. Client-side:
// Basic validation and formatting
function prepareReviewData(userInput) {
  return {
    text: userInput.text.trim(),
    rating: parseInt(userInput.rating),
    photos: userInput.photos.map(photo => ({
      file: photo,
      size: photo.size,
      type: photo.type
    })),
    visitDate: new Date(userInput.visitDate),
    clientTimestamp: new Date()
  };
}
  1. Server-side:
async function processReview(rawReview) {
  // 1. Extract components
  const processed = {
    ...rawReview,
    analysis: {
      hashtags: extractHashtags(rawReview.text),
      emojis: extractEmojis(rawReview.text),
      topics: await analyzeTopics(rawReview.text),
      sentiment: await analyzeSentiment(rawReview.text),
      keywords: await extractKeywords(rawReview.text)
    }
  };

  // 2. Process images
  if (rawReview.photos) {
    processed.photos = await Promise.all(
      rawReview.photos.map(processPhoto)
    );
  }

  // 3. Store processed data
  const storedReview = await saveToDatabase(processed);
  await indexForSearch(processed);

  return storedReview;
}
  1. API Flow:
app.post('/api/reviews', async (req, res) => {
  try {
    // 1. Basic validation
    const rawReview = validateReview(req.body);

    // 2. Process review
    const processedReview = await processReview(rawReview);

    // 3. Return processed review
    res.json({
      success: true,
      review: processedReview
    });
  } catch (error) {
    res.status(400).json({
      success: false,
      error: error.message
    });
  }
});

WHY THIS APPROACH IS BETTER:

  1. Security:
  • Server can validate all data
  • Can't bypass processing rules
  • Can apply rate limiting
  1. Consistency:
  • Same processing for all clients
  • Easier to maintain one codebase
  • Uniform data quality
  1. Performance:
  • Can use powerful server resources
  • Better NLP/AI tools available
  • Async processing possible
  1. Maintainability:
  • Single place to update algorithms
  • Easier to debug
  • Better monitoring
  1. Flexibility:
  • Can change processing without client updates
  • Can A/B test different algorithms
  • Can add new features easily

Remember: "Never trust client-side data" is a good security principle to follow!

We should not Make users wait for data processing. Let me explain different approaches:

  1. Using Message Queue (RabbitMQ):
// API Endpoint
app.post('/api/reviews', async (req, res) => {
  try {
    // 1. Store raw data in MongoDB first
    const rawReview = await db.rawReviews.insertOne({
      userId: req.user.id,
      content: req.body,
      status: 'PENDING',
      createdAt: new Date()
    });

    // 2. Send to queue for processing
    await rabbitMQ.publish('review-queue', {
      reviewId: rawReview._id,
      content: req.body
    });

    // 3. Quick response to user
    res.json({
      success: true,
      message: "Review submitted successfully",
      reviewId: rawReview._id
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

// Consumer Service
rabbitMQ.consume('review-queue', async (message) => {
  const { reviewId, content } = message;
  try {
    // Process review
    const processedData = await processReview(content);
    
    // Update MongoDB
    await db.reviews.insertOne(processedData);
    await db.rawReviews.updateOne(
      { _id: reviewId },
      { $set: { status: 'PROCESSED' } }
    );

    // Index in Elasticsearch
    await indexToElasticsearch(processedData);
  } catch (error) {
    // Handle error, maybe retry
    await db.rawReviews.updateOne(
      { _id: reviewId },
      { $set: { status: 'FAILED', error: error.message } }
    );
  }
});
  1. Using Kafka (Event Streaming):
// Producer (API)
app.post('/api/reviews', async (req, res) => {
  try {
    const rawReview = await db.rawReviews.insertOne({
      userId: req.user.id,
      content: req.body,
      status: 'PENDING'
    });

    // Publish to Kafka
    await kafka.send({
      topic: 'reviews',
      messages: [{
        key: rawReview._id.toString(),
        value: JSON.stringify({
          reviewId: rawReview._id,
          content: req.body
        })
      }]
    });

    res.json({ success: true, reviewId: rawReview._id });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

// Consumer
kafka.subscribe('reviews').on('message', async (message) => {
  const { reviewId, content } = JSON.parse(message.value);
  // Process similar to RabbitMQ example
});
  1. Using Redis for Quick Access:
app.post('/api/reviews', async (req, res) => {
  try {
    // Store in Redis first (for quick access)
    const reviewId = generateId();
    await redis.set(`review:${reviewId}`, JSON.stringify({
      content: req.body,
      status: 'PENDING'
    }), 'EX', 3600); // expire in 1 hour

    // Queue for processing
    await queue.add('process-review', {
      reviewId,
      content: req.body
    });

    res.json({ success: true, reviewId });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

Which Approach to Choose?

  1. RabbitMQ when:
  • Need guaranteed message delivery
  • Processing order matters
  • Simple queue functionality needed
  • Lower latency is priority
  1. Kafka when:
  • Need event streaming/replay
  • High throughput required
  • Want to maintain event history
  • Multiple consumers needed
  • Data replication important
  1. Redis + Queue when:
  • Need very fast initial storage
  • Want to show unprocessed reviews quickly
  • Temporary storage is acceptable
  • Simple queue setup preferred

My Recommendation: For a review system, I'd recommend Kafka because:

  1. Can replay events if processing fails
  2. Multiple services can consume events (processing, analytics, notifications)
  3. Scales well for high volume
  4. Good for future expansion (like analytics)

Example Architecture:

User -> API -> MongoDB (raw) + Kafka
                     ↓
            Processing Service
                     ↓
     MongoDB (processed) + Elasticsearch

This gives you:

  • Instant user feedback
  • Reliable processing
  • Scalability
  • Data consistency
  • Future extensibility