Deployment - liniribeiro/machine_learning GitHub Wiki

Key challenges

ML Issues

Concept drift or data drift (Data changes) - Gradual change - Sudden shock

An example of concept drift is: x = houses sizes y = houses prices

Over time the houses get more expensive, and the same house size will have a higher price such as the big house.

An example of data drift: Suddenly people start building larges or smaler houses and the input distribution of the size x houses changes over time.

So when deploying a ML System, one of the most important task will be often make sure that you detect and manage any changes both in concept drift (when the concept of x and y changes) or data drift (when the concept of x changes, even that the distribution of y does not change).

Software engineering issues

Checklist questions to help to manage software engineering issues:

  1. Do you need real time predictions or batch predictions?
  2. Does your prediction system runs in the cloud or edge/Browser(cars, mobile, factories)?
  3. Compute resources (CPU, GPU, memory)
    • Sometimes the prediction system sometimes must have same resources used for training, or we should compress or reduce the training complexity.
  4. Latency, Throughput (QPS - Query per second)
    • If we have a goal to have 1000 QPS we must make sure that we have the computer resources to do that.
  5. Logging
    • Is good log most data possible for analysis and review or for provide more data for training in the future
  6. Security and privacy

First deployment of the model is only 50% of the work, the second part of the work is just starting after the deployment theres a lot of work to feed the data back and maybe update the model.

Deployment Patterns

Common deployment cases:

  • New product/capability
  • Automate, assist with manual task
  • Replace previous ML system

Key ideas:

  • Gradual ramp up with monitoring
  • rollback

Example:

Shadow mode ML systems shadows the human and runs in parallel ML systems output not used for any decision this phase Used to check if the algorithm predictions are accurate and use that to decide if we will allow the ML system make predictions.

Canary deployment Roll out to small fraction of traffic initially Monitor system ramp up traffic gradually Helps you to stop problems very quick

Blue green deployment Old version of your software is called the blue version And the new version is called green version Router switch the traffic over for the greeen

Easy to enable rollback

Degrees of automation

  • Human only
  • Shadow mode
  • AI assistance
  • Partial automation
  • Full automation

Monitoring

Monitoring dashboard
  • Server load
  • Fraction of non-null output
  • Fraction of missing input values
  1. Brainstorm the things that could go wrong
  2. Brainstorm a few metrics that will detect the problem

Examples (Speech recognition):

  • Software metrics:
    • Memory, compute, latency, throughput, server load
  • Input metrics (x):
    • Avg input length, input volume, num of missing values, avg image brightness,
  • Output metrics (y):
    • time return null, times user re-do search, times user switch to typing, CTR

ML modeling and deployment are an iterative work Screenshot 2024-09-20 at 08 49 08

  • Set thresholds for alarms
  • Adapt the metrics over time

**Model maintenance **

  • Manual retraining
  • Automatic retraining Is only by monitoring the system that we can spot if theres a problem and go back to get more data to update our model to improve our system performance.

Pipeline Monitoring

  • software metrics
  • input and output metrics