认识MLflow.md - liuxiang/liuxiang.github.io GitHub Wiki

一. MLflow

三组件

  • Tracking:模型训练 - 模型参数、指标记录繁琐,Tracking可以记录模型的配置信息,并可视化展示.
  • Projects: 模型工程管理 - 模型结果难以再现,Projects通过conda重现模型所需环境、依赖,使得模型结果可以复现
  • Models: 模型包管理- 开发的模型部署难,Models打包、封装模型,并提供部署

改善的问题: 部署ML很难 由于需要运行的部署工具和环境过多(例如REST服务,批处理推理或移动应用程序),因此将模型迁移到生产可能具有挑战性。没有标准的方法将模型从任何库移动到这些工具中,从而为每个新部署创建新的风险。 MLflow:一个开源的机器学习平台 - 简书

1.示例训练

https://github.com/mlflow/mlflow/tree/master/examples

quickstart/mlflow_tracking.py 是介绍MLflow概念的基本示例。

pytorch在MNIST数据集上使用CNN进行字符识别。该示例记录TensorBoard事件并将其存储(记录)为MLflow工件。
remote_store 有一个用于跟踪的基于REST的后备存储的用法示例。
sklearn_elasticnet_diabetes 使用sklearn糖尿病数据集通过ElasticNet预测糖尿病的进展。
sklearn_elasticnet_wine_quality是MLflow项目的示例。这使用Wine Quality数据集和Elastic Net来预测质量。该示例用于MLproject设置Conda环境,定义参数类型和默认值,培训的入口点等。
docker 演示如何使用docker(而非conda)创建和运行MLflow项目来管理项目依赖项

环境准备

Anaconda3 : download
mlflow : pip install mlflow
examples: https://github.com/mlflow/mlflow/tree/master/examples

基础训练示例 examples/quickstart/mlflow_tracking.py

python examples/quickstart/mlflow_tracking.py

import os
from random import random, randint

from mlflow import log_metric, log_param, log_artifacts

if __name__ == "__main__":
    print("Running mlflow_tracking.py")

    # 入参
    log_param("param1", randint(0, 100))

    # 指标
    log_metric("foo", random())
    log_metric("foo", random() + 1)
    log_metric("foo", random() + 2)

    if not os.path.exists("outputs"):
        os.makedirs("outputs")
    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")

    log_artifacts("outputs") # 训练
  • 输出
.
├── mlruns
│   └── 0
│       ├── c866b38f23b441fe9d23c42db04ff42d
│       │   ├── artifacts
│       │   │   └── test.txt
│       │   ├── meta.yaml
│       │   ├── metrics
│       │   │   └── foo
│       │   ├── params
│       │   │   └── param1
│       │   └── tags
│       │       ├── mlflow.source.name
│       │       ├── mlflow.source.type
│       │       └── mlflow.user
│       └── meta.yaml
├── outputs
│   └── test.txt
└── quickstart
    └── mlflow_tracking.py
  • UI 服务(即 http://localhost:5000)

    使用MLflow UI比较您生成的模型。在与包含mlruns运行的目录相同的当前工作目录中:

    $ mlflow ui                                      
    [2020-06-22 14:18:05 +0800] [26220] [INFO] Starting gunicorn 20.0.4
    [2020-06-22 14:18:05 +0800] [26220] [INFO] Listening at: http://127.0.0.1:5000 (26220)
    [2020-06-22 14:18:05 +0800] [26220] [INFO] Using worker: sync
    [2020-06-22 14:18:05 +0800] [26224] [INFO] Booting worker with pid: 26224
    

2.mlflow运行模型 & 生成模型包(*.pkl)

2.1 示例模型Project结构

└── mlflow-example
    ├── LICENSE.txt
    ├── MLproject           # Project元数据描述 ★
    ├── README.md						
    ├── conda.yaml					# 依赖描述
    ├── train.py						# 训练代码 ★
    └── wine-quality.csv		# 依赖材料

2.2 mlflow run examples/sklearn_elasticnet_wine

https://github.com/mlflow/mlflow/tree/master/examples/sklearn_elasticnet_wine

  • MLproject (Project元数据描述)
name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: {type: float, default: 0.5}
      l1_ratio: {type: float, default: 0.1}
    command: "python train.py {alpha} {l1_ratio}"
  • conda.yaml
name: tutorial
channels:
  - defaults
  - anaconda
  - conda-forge
dependencies:
  - python=3.6
  - scikit-learn=0.19.1
  - pip
  - pip:
    - mlflow>=1.0
  • train.py (log_model将生成模型包*.pkl)
...
        if tracking_url_type_store != "file":

            # Register the model
            # There are other ways to use the Model Registry, which depends on the use case,
            # please refer to the doc for more information:
            # https://mlflow.org/docs/latest/model-registry.html#api-workflow
            mlflow.sklearn.log_model(lr, "model", registered_model_name="ElasticnetWineModel")
        else:
            mlflow.sklearn.log_model(lr, "model")  # 生成模型包
  • mlflow run examples/sklearn_elasticnet_wine -P alpha=0.4
$ mlflow run examples/sklearn_elasticnet_wine -P alpha=0.4
# 创建conda环境
2020/06/22 12:51:00 INFO mlflow.projects: === Creating conda environment mlflow-6284a367a61b51ccdf445333a216776597fb4efc ===
WARNING: The conda.compat module is deprecated and will be removed in a future release.
WARNING: The conda.compat module is deprecated and will be removed in a future release.
Collecting package metadata: done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.6.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda

# 下载依赖
Downloading and Extracting Packages
tbb4py-2020.0        | 64 KB     | ###################################### | 100% 
mkl_fft-1.0.6        | 139 KB    | ###################################### | 100% 
scikit-learn-0.19.1  | 4.8 MB    | ###################################### | 100% 
numpy-1.15.4         | 35 KB     | ###################################### | 100% 
mkl_random-1.0.1     | 349 KB    | ###################################### | 100% 
tbb-2020.0           | 167 KB    | ###################################### | 100% 
numpy-base-1.15.4    | 4.1 MB    | ###################################### | 100% 
python-3.6.10        | 20.5 MB   | ###################################### | 100% 
pip-20.1.1           | 2.0 MB    | ###################################### | 100% 
mkl-2018.0.3         | 149.2 MB  | ###################################### | 100% 
certifi-2020.4.5.2   | 160 KB    | ###################################### | 100% 
scipy-1.1.0          | 15.4 MB   | ###################################### | 100% 
setuptools-47.3.0    | 643 KB    | ###################################### | 100% 
wheel-0.34.2         | 49 KB     | ###################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 为conda虚拟环境mlflow-*,安装依赖
Ran pip subprocess with arguments:
['/Users/liuxiang/anaconda3/envs/mlflow-6284a367a61b51ccdf445333a216776597fb4efc/bin/python', '-m', 'pip', 'install', '-r', '/Users/liuxiang/PycharmProjects/mlflow_demo/examples/sklearn_elasticnet_wine/condaenv.jio31sys.requirements.txt']
Pip subprocess output:
Collecting mlflow>=1.0
  Using cached mlflow-1.9.0-py3-none-any.whl (11.9 MB)
...
Successfully built querystring-parser pyyaml prometheus-flask-exporter databricks-cli alembic sqlalchemy

# 安装完成如下依赖集合
Installing collected packages: six, protobuf, smmap, gitdb, gitpython, pycparser, cffi, cryptography, urllib3, idna, chardet, requests, oauthlib, requests-oauthlib, isodate, msrest, azure-core, azure-storage-blob, gunicorn, websocket-client, docker, querystring-parser, pyyaml, entrypoints, python-dateutil, prometheus-client, itsdangerous, click, Werkzeug, MarkupSafe, Jinja2, Flask, prometheus-flask-exporter, tabulate, databricks-cli, gorilla, cloudpickle, Mako, sqlalchemy, python-editor, alembic, sqlparse, pytz, pandas, mlflow
Successfully installed Flask-1.1.2 Jinja2-2.11.2 Mako-1.1.3 MarkupSafe-1.1.1 Werkzeug-1.0.1 alembic-1.4.2 azure-core-1.6.0 azure-storage-blob-12.3.2 cffi-1.14.0 chardet-3.0.4 click-7.1.2 cloudpickle-1.4.1 cryptography-2.9.2 databricks-cli-0.11.0 docker-4.2.1 entrypoints-0.3 gitdb-4.0.5 gitpython-3.1.3 gorilla-0.3.0 gunicorn-20.0.4 idna-2.9 isodate-0.6.0 itsdangerous-1.1.0 mlflow-1.9.0 msrest-0.6.16 oauthlib-3.1.0 pandas-1.0.5 prometheus-client-0.8.0 prometheus-flask-exporter-0.14.1 protobuf-3.12.2 pycparser-2.20 python-dateutil-2.8.1 python-editor-1.0.4 pytz-2020.1 pyyaml-5.3.1 querystring-parser-1.2.4 requests-2.24.0 requests-oauthlib-1.3.0 six-1.15.0 smmap-3.0.4 sqlalchemy-1.3.13 sqlparse-0.3.1 tabulate-0.8.7 urllib3-1.25.9 websocket-client-0.57.0

#
# To activate this environment, use
#
#     $ conda activate mlflow-6284a367a61b51ccdf445333a216776597fb4efc
#
# To deactivate an active environment, use
#
#     $ conda deactivate

2020/06/22 13:08:26 INFO mlflow.projects: === Created directory /var/folders/0q/89pc1xbn38bcg4zff6b3qnpc0000gn/T/tmp226r4wv1 for downloading remote URIs passed to arguments of type 'path' ===
2020/06/22 13:08:26 INFO mlflow.projects: === Running command 'source /Users/liuxiang/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-6284a367a61b51ccdf445333a216776597fb4efc 1>&2 && python train.py 0.4 0.1' in run with ID 'eaacfe12eee3436399f6125baeda013f' === 
Elasticnet model (alpha=0.400000, l1_ratio=0.100000):
  RMSE: 0.7410782793160982
  MAE: 0.5712718681984226
  R2: 0.22185255063708886
2020/06/22 13:09:22 INFO mlflow.projects: === Run (ID 'eaacfe12eee3436399f6125baeda013f') succeeded ===

1.创建虚拟环境 & 安装依赖 Creating conda environment mlflow-6284a367a61b51ccdf445333a216776597fb4efc

2.Running command : (激活虚拟环境 > 运行训练py) /Users/liuxiang/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-6284a367a61b51ccdf445333a216776597fb4efc 1>&2 && python train.py 0.4 0.1

  • 二次执行 mlflow run examples/sklearn_elasticnet_wine可重用虚拟环境

     mlflow run examples/sklearn_elasticnet_wine -P alpha=0.4
    2020/06/22 13:59:49 INFO mlflow.projects: === Created directory /var/folders/0q/89pc1xbn38bcg4zff6b3qnpc0000gn/T/tmp_f0csjn6 for downloading remote URIs passed to arguments of type 'path' ===
    2020/06/22 13:59:49 INFO mlflow.projects: === Running command 'source /Users/liuxiang/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-6284a367a61b51ccdf445333a216776597fb4efc 1>&2 && python train.py 0.4 0.1' in run with ID '261a7ce97e7b4dffbff3a74391ca3865' === 
    Elasticnet model (alpha=0.400000, l1_ratio=0.100000):
      RMSE: 0.7410782793160982
      MAE: 0.5712718681984226
      R2: 0.22185255063708886
    2020/06/22 13:59:56 INFO mlflow.projects: === Run (ID '261a7ce97e7b4dffbff3a74391ca3865') succeeded ===
    
  • log_model后会生成模型包,所在位置/Users/.../mlruns/0/261a7ce97e7b4dffbff3a74391ca3865/artifacts/model

    ├── artifacts
    │   └── model
    │       ├── MLmodel
    │       ├── conda.yaml
    │       └── model.pkl
    ├── meta.yaml
    ├── metrics
    │   ├── mae
    │   ├── r2
    │   └── rmse
    ├── params
    │   ├── alpha
    │   └── l1_ratio
    └── tags
        ├── mlflow.log-model.history
        ├── mlflow.project.backend
        ├── mlflow.project.entryPoint
        ├── mlflow.project.env
        ├── mlflow.source.name
        ├── mlflow.source.type
        └── mlflow.user
    
    
    # Mlmodel (模型元数据描述)
    artifact_path: model
    flavors:
      python_function:
        data: model.pkl
        env: conda.yaml
        loader_module: mlflow.sklearn
        python_version: 3.6.10
      sklearn:
        pickled_model: model.pkl
        serialization_format: cloudpickle
        sklearn_version: 0.19.1
    run_id: 23ff99150d52426ba7d083647ebf76fc
    utc_time_created: '2020-06-22 09:47:49.837802'
    
    # conda.yaml (环境依赖描述)
    channels:
    - defaults
    - conda-forge
    dependencies:
    - python=3.6.10
    - scikit-learn=0.19.1
    - pip
    - pip:
      - mlflow
      - cloudpickle==1.4.1
    name: mlflow-env
    
    # model.pkl (模型构件,加载方式由Mlmodel文件中flavors.python_function.loader_module确定)
    
  • conda.yaml

    name: tutorial
    channels:
      - defaults
    dependencies:
      - numpy>=1.14.3
      - pandas>=1.0.0
      - scikit-learn=0.19.1
      - pip:
        - mlflow
    
  • 效果同上类似

    1.创建虚拟环境 & 安装依赖
    Creating conda environment mlflow-****
    
    2.Running command : (激活虚拟环境 > 运行训练py)
    /Users/liuxiang/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow--**** 1>&2 && python train.py 0.4 0.1
    

2.4 示例: 多版本管理 // TODO

3. 部署模型

示例: --model-uri runs:/<run-id>/model

# 运行训练代码,输出mlflow run-id
$ python examples/sklearn_logistic_regression/train.py
Score: 0.666
Model saved in run <run-id>

# 指定model-uri,启动mlflow rest服务
$ mlflow models serve --model-uri runs:/<run-id>/model
// 1.创建虚拟环境 & 安装依赖 Creating conda environment mlflow-****
// 2.Running command : (激活虚拟环境 > gunicorn启动rest服务)
// INFO mlflow.pyfunc.backend: === Running command 'source /Users/liuxiang/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-255627e9113d091dfd14deabc838d1a982b4883f 1>&2 && gunicorn --timeout=60 -b 127.0.0.1:5000 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'

# 模型预测RestAPI
$ curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
    "columns": ["a", "b", "c"],
    "data": [[1, 2, 3], [4, 5, 6]]
}'
[1, 0]  // 结果

示例: /Users/.../mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model

1.保存模型(artifact)

# 参考:https://github.com/mlflow/mlflow/tree/master/examples/sklearn_elasticnet_wine/train.py
mlflow.sklearn.log_model(lr, "model")

# 通过`mlflow run examples/sklearn_elasticnet_wine`调用`mlflow.sklearn.log_model`会生成模型包.
**/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model (其中包含模型元数据描述文件MLmodel 和 模型文件model.pkl)
- `MLmodel`元数据文件是告诉MLflow如何加载模型
- `model.pkl`文件是训练好的序列化的线性回归模型
# MLmodel
artifact_path: model

flavors:
  python_function:
    data: model.pkl   # 模型构建
    env: conda.yaml   # 依赖描述
    loader_module: mlflow.sklearn
    python_version: 3.6.10
    
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.19.1
    
run_id: 23ff99150d52426ba7d083647ebf76fc
utc_time_created: '2020-06-22 09:47:49.837802'
# conda.yaml
channels:
- defaults
- conda-forge
dependencies:
- python=3.6.10
- scikit-learn=0.19.1
- pip
- pip:
  - mlflow
  - cloudpickle==1.4.1
name: mlflow-env

2.部署成REST服务(每个模型一个serve服务)

$ mlflow models serve -m /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/9806fbb1d1944a8f9761356b76de42ad/artifacts/model -p 1234

2020/06/22 15:47:00 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
# 创建conda虚拟环境
2020/06/22 15:47:05 INFO mlflow.projects: === Creating conda environment mlflow-e60aad19f9c07786c0e7c02eae8e55269edb64dd ===
WARNING: The conda.compat module is deprecated and will be removed in a future release.
WARNING: The conda.compat module is deprecated and will be removed in a future release.
Collecting package metadata: done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.6.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda


Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 安装pip依赖(Using cached使用cache)
Ran pip subprocess with arguments:
['/Users/liuxiang/anaconda3/envs/mlflow-e60aad19f9c07786c0e7c02eae8e55269edb64dd/bin/python', '-m', 'pip', 'install', '-r', '/Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/9806fbb1d1944a8f9761356b76de42ad/artifacts/model/condaenv.orjbg04k.requirements.txt']
Pip subprocess output:
Collecting mlflow
  Using cached mlflow-1.9.0-py3-none-any.whl (11.9 MB)
Processing /Users/liuxiang/Library/Caches/pip/wheels/69/38/7a/072b5863ca334d012821a287fd1d066cea33abdcda3ef2f878/querystring_parser-1.2.4-py3-none-any.whl
... 
Installing collected packages: pytz, six, python-dateutil, pandas, sqlparse, gorilla, querystring-parser, protobuf, chardet, urllib3, idna, requests, azure-core, oauthlib, requests-oauthlib, isodate, msrest, pycparser, cffi, cryptography, azure-storage-blob, click, smmap, gitdb, gitpython, websocket-client, docker, tabulate, databricks-cli, Werkzeug, MarkupSafe, Jinja2, itsdangerous, Flask, prometheus-client, prometheus-flask-exporter, pyyaml, gunicorn, sqlalchemy, cloudpickle, python-editor, Mako, alembic, entrypoints, mlflow
Successfully installed Flask-1.1.2 Jinja2-2.11.2 Mako-1.1.3 MarkupSafe-1.1.1 Werkzeug-1.0.1 alembic-1.4.2 azure-core-1.6.0 azure-storage-blob-12.3.2 cffi-1.14.0 chardet-3.0.4 click-7.1.2 cloudpickle-1.4.1 cryptography-2.9.2 databricks-cli-0.11.0 docker-4.2.1 entrypoints-0.3 gitdb-4.0.5 gitpython-3.1.3 gorilla-0.3.0 gunicorn-20.0.4 idna-2.9 isodate-0.6.0 itsdangerous-1.1.0 mlflow-1.9.0 msrest-0.6.16 oauthlib-3.1.0 pandas-1.0.5 prometheus-client-0.8.0 prometheus-flask-exporter-0.14.1 protobuf-3.12.2 pycparser-2.20 python-dateutil-2.8.1 python-editor-1.0.4 pytz-2020.1 pyyaml-5.3.1 querystring-parser-1.2.4 requests-2.24.0 requests-oauthlib-1.3.0 six-1.15.0 smmap-3.0.4 sqlalchemy-1.3.13 sqlparse-0.3.1 tabulate-0.8.7 urllib3-1.25.9 websocket-client-0.57.0

#
# To activate this environment, use
#
#     $ conda activate mlflow-e60aad19f9c07786c0e7c02eae8e55269edb64dd
#
# To deactivate an active environment, use
#
#     $ conda deactivate

# 使用conda虚拟环境 & gunicorn启动Rest服务
2020/06/22 15:49:20 INFO mlflow.pyfunc.backend: === Running command 'source /Users/liuxiang/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-e60aad19f9c07786c0e7c02eae8e55269edb64dd 1>&2 && gunicorn --timeout=60 -b 127.0.0.1:1234 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
[2020-06-22 15:49:22 +0800] [34419] [INFO] Starting gunicorn 20.0.4
[2020-06-22 15:49:22 +0800] [34419] [INFO] Listening at: http://127.0.0.1:1234 (34419)
[2020-06-22 15:49:22 +0800] [34419] [INFO] Using worker: sync
[2020-06-22 15:49:22 +0800] [34433] [INFO] Booting worker with pid: 34433
  1. Rest http预测
curl -X POST -H "Content-Type:application/json; format=pandas-split" \
	--data '{"columns":["alcohol", "chlorides", "citric acid", "density", "fixed acidity", "free sulfur dioxide", "pH", "residual sugar", "sulphates", "total sulfur dioxide", "volatile acidity"],"data":[[12.8, 0.029, 0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}' \
	http://127.0.0.1:1234/invocations

响应输出: [3.943440394399964]%

MLflow系列1:MLflow入门教程(Python) - ZH奶酪 - 博客园

示例: --no-conda 不使用conda虚拟环境

如果不需要conda,则需要保障运行的环境已经安装了必要的依赖,在命令上加上--no-conda即可

$ mlflow models serve -m /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/9806fbb1d1944a8f9761356b76de42ad/artifacts/model -p 1234 --no-conda

2020/06/22 16:06:09 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
# 标记--no-conda后将不再使用虚拟环境,而是直接依赖当前宿主环境
# 直接启动Rest服务
2020/06/22 16:06:09 INFO mlflow.pyfunc.backend: === Running command 'gunicorn --timeout=60 -b 127.0.0.1:1234 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
[2020-06-22 16:06:09 +0800] [36346] [INFO] Starting gunicorn 20.0.4
[2020-06-22 16:06:09 +0800] [36346] [INFO] Listening at: http://127.0.0.1:1234 (36346)
[2020-06-22 16:06:09 +0800] [36346] [INFO] Using worker: sync
[2020-06-22 16:06:09 +0800] [36349] [INFO] Booting worker with pid: 36349
/Users/liuxiang/anaconda3/envs/mlflow_demo/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.linear_model.coordinate_descent module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.linear_model. Anything that cannot be imported from sklearn.linear_model is now part of the private API.
  warnings.warn(message, FutureWarning)
/Users/liuxiang/anaconda3/envs/mlflow_demo/lib/python3.7/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator ElasticNet from version 0.19.1 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
  • 缺少依赖情况

    $ pip freeze |grep scikit-learn
    scikit-learn==0.23.1
    
    $ pip uninstall scikit-learn # 卸载宿主依赖,观察--no-conda后是否会安装依赖到宿主环境(结果:不会,而是直接执行出错)
    • 卸载宿主依赖,观察--no-conda后是否会安装依赖到宿主环境(结果:不会,而是直接执行出错)
    mlflow models serve -m /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/9806fbb1d1944a8f9761356b76de42ad/artifacts/model -p 1234 --no-conda
    2020/06/22 16:12:03 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
    2020/06/22 16:12:03 INFO mlflow.pyfunc.backend: === Running command 'gunicorn --timeout=60 -b 127.0.0.1:1234 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
    [2020-06-22 16:12:03 +0800] [36790] [INFO] Starting gunicorn 20.0.4
    [2020-06-22 16:12:03 +0800] [36790] [INFO] Listening at: http://127.0.0.1:1234 (36790)
    [2020-06-22 16:12:03 +0800] [36790] [INFO] Using worker: sync
    [2020-06-22 16:12:03 +0800] [36794] [INFO] Booting worker with pid: 36794
    [2020-06-22 16:12:05 +0800] [36794] [ERROR] Exception in worker process
    Traceback (most recent call last):
     ...
      File "/Users/liuxiang/anaconda3/envs/mlflow_demo/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 466, in load_model
        model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
      File "/Users/liuxiang/anaconda3/envs/mlflow_demo/lib/python3.7/site-packages/mlflow/sklearn.py", line 280, in _load_pyfunc
        return _load_model_from_local_file(path)
      File "/Users/liuxiang/anaconda3/envs/mlflow_demo/lib/python3.7/site-packages/mlflow/sklearn.py", line 271, in _load_model_from_local_file
        return pickle.load(f)
    ModuleNotFoundError: No module named 'sklearn'
    [2020-06-22 16:12:05 +0800] [36794] [INFO] Worker exiting (pid: 36794)
    [2020-06-22 16:12:05 +0800] [36790] [INFO] Shutting down: Master
    [2020-06-22 16:12:05 +0800] [36790] [INFO] Reason: Worker failed to boot.

支持的引擎类型(Built-In Model Flavors)

详见: https://www.mlflow.org/docs/latest/models.html#fields-in-the-mlmodel-format (建模人员需依据此文档,编写训练文件train.py,其中mlflow.pyfunc.save_model来输出mlflow格式的模型包)

Python Function (python_function)
R Function (crate)
H2O (h2o)
Keras (keras)
MLeap (mleap)
PyTorch (pytorch)
Scikit-learn (sklearn)
Spark MLlib (spark)
TensorFlow (tensorflow)
ONNX (onnx)
MXNet Gluon (gluon)
XGBoost (xgboost)
LightGBM (lightgbm)

模型格式(pyfunc)

https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-filesystem-format

# 格式
./dst-path/
    ./MLmodel: configuration
    <code>: code packaged with the model (specified in the MLmodel file)
    <data>: data packaged with the model (specified in the MLmodel file)
    <env>: Conda environment definition (specified in the MLmodel file)
    
# 示例
├── MLmodel
├── code
│   ├── sklearn_iris.py

├── data
│   └── model.pkl
└── mlflow_env.yml

### cat MLmodel
python_function:
  code: code
  data: data/model.pkl
  loader_module: mlflow.sklearn
  env: mlflow_env.yml
  main: sklearn_iris

Command-Line预测(mlflow models predict)

详见: https://www.mlflow.org/docs/latest/cli.html#mlflow-models-predict

mlflow models predict [OPTIONS]

4.自定义python模型

https://mlflow.org/docs/latest/models.html#custom-python-models

  • pyfunc.py 模型保存
> pyfunc.py << EOF
# Define the model class
class AddN(mlflow.pyfunc.PythonModel):

    def __init__(self, n):
        self.n = n

    def predict(self, context, model_input):
        return model_input.apply(lambda column: column + self.n)

# Construct and save the model
model_path = "add_n_model"
add5_model = AddN(n=5)
mlflow.pyfunc.save_model(path=model_path, python_model=add5_model) # 会输出模型文件`add_n_model`

# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(model_path)

# Evaluate the model
import pandas as pd
model_input = pd.DataFrame([range(10)])
model_output = loaded_model.predict(model_input)
assert model_output.equals(pd.DataFrame([range(5, 15)]))

EOF

# 部署
mlflow models serve -m add_n_model -p 5000 --no-conda
  • 输出模型(add_n_model)
├── add_n_model
│   ├── MLmodel
│   ├── conda.yaml
│   └── python_model.pkl

# MLmodel
flavors:
  python_function:
    cloudpickle_version: 1.4.1
    env: conda.yaml
    loader_module: mlflow.pyfunc.model
    python_model: python_model.pkl
    python_version: 3.7.7
utc_time_created: '2020-06-24 09:19:20.559331'

# conda.yaml
channels:
- defaults
- conda-forge
dependencies:
- python=3.7.7
- pip
- pip:
  - mlflow
  - cloudpickle==1.4.1
name: mlflow-env
  • 测试
# split-oriented
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
    "columns": ["a", "b", "c"],
    "data": [[1, 2, 3], [4, 5, 6]]
}'

# record-oriented (fine for vector rows, loses ordering for JSON records)
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json; format=pandas-records' -d '[
    {"a": 1,"b": 2,"c": 3},
    {"a": 4,"b": 5,"c": 6}
]'

问题: 自定义的python模型一定要将函数通过mlflow.pyfunc.save_model输出为*.pkl吗? > 目前理解是这样的. 客户存量模型需要运行在mlflow服务中,需要将模型的py代码mlflow.pyfunc.save_model为*.pkl

5.MLflow模型部署到自定义服务

https://mlflow.org/docs/latest/models.html#deployment-to-custom-targets

除了内置的部署工具外,MLflow还提供了可插入的 mlflow.deployments Python APImlflow部署CLI,用于将模型部署到自定义目标和环境。要部署到自定义目标,您必须首先安装适当的第三方Python插件。请在此处查看社区维护的已知插件列表 。

二. Client

客户端能力范围

能否通过API部署一个模型Rest服务?
> 模型的部署需依赖conda虚拟环境(或完善的宿主环境),过程中耗时较长(1~3+分钟).另外接收请求本身就需要服务环境,创建的环境一般不是当前服务,而是新的模型服务.在多次创建模型服务时,如何控制资源都是问题.目前mlflow暂不支持RestApi部署模型服务.
> mlflow服务本质是模型训练&比对的服务(端口:5000).额外支持服务的部署运行(端口:<自定义>).

模型服务如何初始化?
> 容器环境: 在Dockerfile中描述`ENTRYPOINT ["./bin/start.sh"]`.启动的具体模型通过环境变量传入容器,模型文件通过挂盘的方式功共享.
> 通用集群: mlflow不支持一个端口部署多个模型.多端口方式看起来就正常. 暂时认为不适合集群化部署.(建模人员提供推理代码:可以满足集群部署)
> 套壳Seldon: seldon本身类似于mlflow,补充了模型部署相关特性.部署策略依然要考虑`容器环境`还是`通用集群`部署.
> 模型服务命令: mlflow models serve -m /Users/.../mlruns/0/9806fbb1d1944a8f9761356b76de42ad/artifacts/model -p 1234

MLflow提供了哪些API能力?
> experiments(实验); runs(模型训练); 资源; 注册模型&模型版本; 标签; 日志&历史;

REST API

https://mlflow.org/docs/latest/rest-api.html

### 启动mlflow服务
mlflow ui (同级mlruns文件夹执行,一般在工程根目录)

### experiments(实验)
- 建立实验 				2.0/mlflow/experiments/create
- 列出实验 				2.0/mlflow/experiments/list
- 进行实验 				2.0/mlflow/experiments/get
- 按名称获取实验 	2.0/mlflow/experiments/get-by-name
- 删除实验 				2.0/mlflow/experiments/delete
- 恢复实验 				2.0/mlflow/experiments/update
- 更新实验 				2.0/mlflow/experiments/update
### 运行(模型训练)
- 创建运行 				2.0/mlflow/runs/create
- 删除运行 				2.0/mlflow/runs/delete
- 恢复运行 				2.0/mlflow/runs/restore
- 获取运行 				2.0/mlflow/runs/get
- 对数指标 				2.0/mlflow/runs/log-metric
- 日志批次 				2.0/mlflow/runs/log-batch
- 日志模型 				2.0/mlflow/runs/log-model
- 搜索运行	 			2.0/mlflow/runs/search
- 更新运行 				2.0/mlflow/runs/update
### 资源
- 列出Artifact 			2.0/mlflow/artifacts/list
### 注册模型
- 创建RegisteredModel 	2.0/preview/mlflow/registered-models/create
- 获取注册模型 			2.0/preview/mlflow/registered-models/get
- 重命名RegisteredModel 	2.0/preview/mlflow/registered-models/rename
- 更新注册模型 			2.0/preview/mlflow/registered-models/update
- 删除RegisteredModel 	2.0/preview/mlflow/registered-models/delete
- 列出注册模型 			2.0/preview/mlflow/registered-models/list
- 搜索注册模型 			2.0/preview/mlflow/registered-models/search
- 设置注册模型标签 		2.0/preview/mlflow/registered-models/set-tag
- 获取最新的模型版本 		2.0/preview/mlflow/registered-models/get-latest-versions
### 模型版本
- 创建模型版本 	   		2.0/preview/mlflow/model-versions/create
- 获取ModelVersion 		2.0/preview/mlflow/model-versions/get
- 更新ModelVersion 		2.0/preview/mlflow/model-versions/update
- 删除模型版本 			2.0/preview/mlflow/model-versions/delete
- 搜索模型版本 			2.0/preview/mlflow/model-versions/search
- 获取ModelVersion工件的下载URI 2.0/preview/mlflow/model-versions/get-download-uri
- 过渡模型版本阶段 		2.0/preview/mlflow/model-versions/transition-stage
### 标签
- 设置实验标签 			2.0/mlflow/experiments/set-experiment-tag
- 设置标签 				2.0/mlflow/runs/set-tag
- 删除标签 				2.0/mlflow/runs/delete-tag
- 删除注册的型号标签 		2.0/preview/mlflow/registered-models/delete-tag 
- 设置型号版本标签 		2.0/preview/mlflow/model-versions/set-tag
- 删除型号版本标签  		2.0/preview/mlflow/model-versions/delete-tag
### 日志&历史
- 日志参数 				2.0/mlflow/metrics/get-history
- 获取指标历史记录 		2.0/mlflow/metrics/get-history

Java Api (能力范围同Rest API)

https://mlflow.org/docs/latest/java_api/index.html

三. Docker

Docker环境下训练模型(替代conda.yaml)

详见: https://github.com/mlflow/mlflow/tree/master/examples/docker https://mlflow.org/docs/latest/projects.html#run-an-mlflow-project-on-kubernetes-experimental

  • build.sh 使用容器准备模型的运行环境(等价于conda.yaml)

    docker build -t mlflow-docker-example -f Dockerfile .
  • Dockerfile

    FROM continuumio/miniconda:4.5.4
    
    RUN pip install mlflow>=1.0 \
        && pip install azure-storage-blob==12.3.0 \
        && pip install numpy==1.14.3 \
        && pip install scipy \
        && pip install pandas==0.22.0 \
        && pip install scikit-learn==0.19.1 \
        && pip install cloudpickle
  • MLproject

    name: docker-example
    
    docker_env:
      image:  mlflow-docker-example  # 环境标记不是conda.yaml,而是image:**
    
    entry_points:
      main:
        parameters:
          alpha: float
          l1_ratio: {type: float, default: 0.1}
        command: "python train.py --alpha {alpha} --l1-ratio {l1_ratio}"
    • kubernetes_config.json

      {
          "kube-context": "docker-for-desktop",
          "kube-job-template-path": "examples/docker/kubernetes_job_template.yaml",
          "repository-uri": "username/mlflow-kubernetes-example"
      }
    • kubernetes_job_template.yaml

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: "{replaced with MLflow Project name}"
        namespace: mlflow
      spec:
        ttlSecondsAfterFinished: 100
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: "{replaced with MLflow Project name}"
              image: "{replaced with URI of Docker image created during Project execution}"
              command: ["{replaced with MLflow Project entry point command}"]
            resources:
              limits:
                memory: 512Mi
              requests:
                memory: 256Mi
            restartPolicy: Never
  • 训练模型(会动态运行docker) & 输出模型包

    $ mlflow run examples/docker -P alpha=0.5
    
    # 编译docker image
    2020/06/22 16:45:00 INFO mlflow.projects: === Building docker image docker-example ===
    2020/06/22 16:45:24 INFO mlflow.projects: === Created directory /var/folders/0q/89pc1xbn38bcg4zff6b3qnpc0000gn/T/tmptuxdfz2j for downloading remote URIs passed to arguments of type 'path' ===
    
    # 运行docker
    2020/06/22 16:45:24 INFO mlflow.projects: === Running command 'docker run --rm -v /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns:/mlflow/tmp/mlruns -v /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/7b2a60fb34fc4244b72200f53fd1ed22/artifacts:/Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/7b2a60fb34fc4244b72200f53fd1ed22/artifacts -e MLFLOW_RUN_ID=7b2a60fb34fc4244b72200f53fd1ed22 -e MLFLOW_TRACKING_URI=file:///mlflow/tmp/mlruns -e MLFLOW_EXPERIMENT_ID=0 docker-example:latest python train.py --alpha 0.5 --l1-ratio 0.1' in run with ID '7b2a60fb34fc4244b72200f53fd1ed22' === 
    /opt/conda/lib/python2.7/site-packages/mlflow/__init__.py:55: DeprecationWarning: MLflow support for Python 2 is deprecated and will be dropped in a future release. At that point, existing Python 2 workflows that use MLflow will continue to work without modification, but Python 2 users will no longer get access to the latest MLflow features and bugfixes. We recommend that you upgrade to Python 3 - see https://docs.python.org/3/howto/pyporting.html for a migration guide.
      "for a migration guide.", DeprecationWarning)
    Elasticnet model (alpha=0.500000, l1_ratio=0.100000):
      RMSE: 0.794793101903653
      MAE: 0.6189130834228139
      R2: 0.18411668718221796
    2020/06/22 16:45:56 INFO mlflow.projects: === Run (ID '7b2a60fb34fc4244b72200f53fd1ed22') succeeded ===

    运行 mlflow run examples/docker将基于mlflow-docker-example 该镜像构建一个新的Docker映像,其中还包含我们的项目代码。

    生成的image被标记为 git commit ID mlflow-docker-example-<git-version>。生成映像后,MLflow使用在容器内执行默认(主)项目入口点docker run

    • 生成模型包

      /Users/***/mlruns/0/7b2a60fb34fc4244b72200f53fd1ed22
      
      ├── artifacts
      │   └── model # 模型包
      │       ├── MLmodel
      │       ├── conda.yaml
      │       └── model.pkl
      ├── meta.yaml
      ├── metrics
      │   ├── mae
      │   ├── r2
      │   └── rmse
      ├── params
      │   ├── alpha
      │   └── l1_ratio
      └── tags
          ├── mlflow.docker.image.id
          ├── mlflow.docker.image.uri
          ├── mlflow.log-model.history
          ├── mlflow.project.backend
          ├── mlflow.project.entryPoint
          ├── mlflow.project.env
          ├── mlflow.source.name
          ├── mlflow.source.type
          └── mlflow.user
  • 模型服务部署 (并未动态启用容器) (应该有依据kubernetes_config.json运行到k8s环境的参数)

    mlflow models serve -m /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/7b2a60fb34fc4244b72200f53fd1ed22/artifacts/model -p 1234 
    > serve并不像mlflow run那要动态创建容器环境.而是去构建conda.yaml了. 标记--no-conda同样是使用宿主环境.都未构建容器环境.
    
    日志:
    mlflow models serve -m /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/7b2a60fb34fc4244b72200f53fd1ed22/artifacts/model -p 1234           
    2020/06/22 16:56:17 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
    2020/06/22 16:56:21 INFO mlflow.projects: === Creating conda environment mlflow-e7388c79eb8fda88c4d7418cb68daab65b2e0839 ===
    WARNING: The conda.compat module is deprecated and will be removed in a future release.
    WARNING: The conda.compat module is deprecated and will be removed in a future release.
    Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
    Collecting package metadata: done
    Solving environment: done
    • 预测

      curl -X POST -H "Content-Type:application/json; format=pandas-split" \
      	--data '{"columns":["alcohol", "chlorides", "citric acid", "density", "fixed acidity", "free sulfur dioxide", "pH", "residual sugar", "sulphates", "total sulfur dioxide", "volatile acidity"],"data":[[12.8, 0.029, 0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}' \
      	http://127.0.0.1:1234/invocations
      
      响应输出: [4.203048595214442]

部署到k8s上运行 (未测试成功)

mlflow models serve -m /Users/liuxiang/PycharmProjects/mlflow_demo/mlruns/0/7b2a60fb34fc4244b72200f53fd1ed22/artifacts/model -p 1234 
> serve并不像mlflow run那要动态创建容器环境.而是去构建conda.yaml了. 标记--no-conda同样是使用宿主环境.都未构建容器环境.
  • 建议直接对接 k8s API. 挂载模型文件卷的方式启动容器.
# 部署
docker run \
  -dp 8088:8088\
  -v /Users/liuxiang/models:/home/admin/src \
  -w="/home/admin/src" \
  -e modelPath="sklearn_elasticnet_wine" \
  registry.tongdun.me/ml/centos7.2-common-anaconda3-mlflow:latest \
  /bin/bash -c "mlflow models serve -m \$modelPath -p 8088 --no-conda --host 0.0.0.0"

# 测试
curl -X POST -H "Content-Type:application/json; format=pandas-split" \
  --data '{"columns":["alcohol", "chlorides", "citric acid", "density", "fixed acidity", "free sulfur dioxide", "pH", "residual sugar", "sulphates", "total sulfur dioxide", "volatile acidity"],"data":[[12.8, 0.029, 0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}' \
  http://0.0.0.0:8088/invocations

build-docker (将模型制作为镜像,即:模型的容器部署) (实验性功能)

https://mlflow.org/docs/latest/cli.html#mlflow-models-build-docker

# 编译模型(some-run-uuid)到image
mlflow models build-docker -m "runs:/some-run-uuid/my-model" -n "my-image-name"

# 启动容器(提供服务)
docker run -p 5001:8080 "my-image-name"
docker run -p 5001:8080 -e DISABLE_NGINX=true "my-image-name"

dzinsouhpe/mlflow 容器环境

参考: https://hub.docker.com/r/dzinsouhpe/mlflow/dockerfile 作用: mlflow的容器环境,可用于模型训练(MLProject)

  • Dockerfile

    FROM bluedata/centos7:latest
    LABEL maintainer="[email protected]"
    
    RUN yum install -y epel-release
    RUN yum install -y htop vim wget net-tools git unzip zip python3 libgfortran libgomp
    RUN yum group install -y "Development Tools"
    
    RUN pip3 install mlflow
    RUN pip3 install boto3
    RUN useradd mlflow
    RUN mkdir /opt/mlflow
    RUN mkdir /opt/mlflow/backend-store
    RUN mkdir /opt/mlflow/log
    RUN mkdir /opt/mlflow/bin
    COPY start.sh /opt/mlflow/bin/start.sh
    RUN chmod +x /opt/mlflow/bin/start.sh
    RUN chown -R mlflow:mlflow /opt/mlflow
    
    WORKDIR /opt/mlflow/
    USER mlflow
    
    EXPOSE 5000
    
    ENTRYPOINT ["./bin/start.sh"]
    • start.sh (mlflow server 类似与mlflow ui)
    #!/bin/bash
    export LC_ALL="en_US.UTF-8"
    mlflow server --backend-store-uri $MLFLOW_BACKEND_STORE --default-artifact-root $MLFLOW_ARTIFACT_ROOT --host 0.0.0.0

参考
MLflow使用方法 - 简书
MLFlow机器学习管理平台入门教程一览
MLflow-教程 - 知乎
你是如何管理机器学习实验的?-机器学习实验管理平台大盘点 - 知乎
MLflow Projects 模块解析_mlflow每次都需要创建虚拟环境

⚠️ **GitHub.com Fallback** ⚠️