Databento data catalog Databento - Loren1166/NautilusTrader- GitHub Wiki

Databento data catalog Databento 数据目录

Tutorial for NautilusTrader a high-performance algorithmic trading platform and event driven backtester.

NautilusTrader 教程，一个高性能算法交易平台和事件驱动的回测器。

View source on GitHub.

在 GitHub 上查看源代码。

info 信息

We are currently working on this article.

我们目前正在撰写本文。

Overview 概述

This tutorial will walk through how to set up a Nautilus Parquet data catalog with various Databento schemas.

本教程将逐步介绍如何使用各种 Databento 架构设置 Nautilus Parquet 数据目录。

Prerequisites 先决条件

Python 3.10+ installed 安装 Python 3.10+

JupyterLab or similar installed (pip install -U jupyterlab) 安装 JupyterLab 或类似软件 (pip install -U jupyterlab)

NautilusTrader latest release installed (pip install -U nautilus_trader) 安装 NautilusTrader 最新版本 (pip install -U nautilus_trader)

databento Python client library installed to make data requests (pip install -U databento) 安装 databento Python 客户端库以发出数据请求 (pip install -U databento)

Databento account Databento 账户

Requesting data 请求数据

We'll use a Databento historical client for the rest of this tutorial. You can either initialize one by passing your Databento API key to the constructor, or implicitly use the DATABENTO_API_KEY environment variable (as shown).

在本教程的其余部分中，我们将使用 Databento 历史客户端。您可以通过将您的 Databento API 密钥传递给构造函数来初始化一个，或者隐式使用 DATABENTO_API_KEY 环境变量（如下所示）。

import databento as db


client = db.Historical()  # This will use the DATABENTO_API_KEY environment variable (recommended best practice) # 这将使用 DATABENTO_API_KEY 环境变量（推荐的最佳实践）

It's important to note that every historical streaming request from timeseries.get_range will incur a cost (even for the same data), therefore we need to:

重要的是要注意，timeseries.get_range 的每个历史流请求都会产生成本（即使是相同的数据），因此我们需要：

Know and understand the cost prior to making a request. 在发出请求之前了解并理解成本。

Not make requests for the same data more than once (not efficient). 不要多次请求相同的数据（效率不高）。

Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again). 通过写入 zstd 压缩的 DBN 文件将响应持久化到磁盘（这样我们就不必再次请求）。

We can use a metadata get_cost endpoint from the Databento API to get a quote on the cost, prior to each request. Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.

我们可以使用 Databento API 中的元数据 get_cost 端点在每次请求之前获取成本报价。每个请求序列将首先请求数据的成本，然后仅当数据尚不存在于磁盘上时才发出请求。

Note the response returned is in USD, displayed as fractional cents.

请注意，返回的响应以美元为单位，显示为小数美分。

The following request is only for a small amount of data (as used in this Medium article Building high-frequency trading signals in Python with Databento and sklearn), just to demonstrate the basic workflow.

以下请求仅针对少量数据（如本 Medium 文章中所用：使用 Databento 和 sklearn 在 Python 中构建高频交易信号），只是为了演示基本工作流程。

from pathlib import Path

from databento import DBNStore

We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial.

我们将准备一个目录，用于存放原始 Databento DBN 格式的数据，我们将在本教程的其余部分中使用它。

DATABENTO_DATA_DIR = Path("databento")
DATABENTO_DATA_DIR.mkdir(exist_ok=True)

# Request cost quote (USD) - this endpoint is 'free'
# 请求成本报价（美元）- 此端点是“免费的”
client.metadata.get_cost(
    dataset="GLBX.MDP3",
    symbols=["ES.n.0"],
    stype_in="continuous",
    schema="mbp-10",
    start="2023-12-06T14:30:00",
    end="2023-12-06T20:30:00",
)

Use the historical API to request for the data used in the Medium article.

使用历史 API 请求 Medium 文章中使用的数据。

path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"

if not path.exists():
    # Request data
    # 请求数据
    client.timeseries.get_range(
        dataset="GLBX.MDP3",
        symbols=["ES.n.0"],
        stype_in="continuous",
        schema="mbp-10",
        start="2023-12-06T14:30:00",
        end="2023-12-06T20:30:00",
        path=path,  # <-- Passing a `path` parameter will ensure the data is written to disk 传递 `path` 参数将确保数据写入磁盘
    )

Inspect the data by reading from disk and convert to a pandas.DataFrame.

通过从磁盘读取数据并转换为 pandas.DataFrame 来检查数据。

data = DBNStore.from_file(path)

df = data.to_df()
df

Write to data catalog 写入数据目录

import shutil
from pathlib import Path

from nautilus_trader.adapters.databento.loaders import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog

CATALOG_PATH = Path.cwd() / "catalog"

# Clear if it already exists
# 如果已存在，则清除
if CATALOG_PATH.exists():
    shutil.rmtree(CATALOG_PATH)
CATALOG_PATH.mkdir()

# Create a catalog instance
# 创建目录实例
catalog = ParquetDataCatalog(CATALOG_PATH)

Now that we've prepared the data catalog, we need a DatabentoDataLoader which we'll use to decode and load the data into Nautilus objects.

现在我们已经准备好了数据目录，我们需要一个 DatabentoDataLoader，我们将使用它来解码数据并将其加载到 Nautilus 对象中。

loader = DatabentoDataLoader()

Next, we'll load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient) by setting as_legacy_cython=False.

接下来，我们将通过设置 as_legacy_cython=False 加载 Rust pyo3 对象以写入目录（我们可以使用旧的 Cython 对象，但这效率略低）。

We also pass an instrument_id, which is not required but makes data loading faster as symbology mapping is not required.

我们还传递了一个 instrument_id，这不是必需的，但可以加快数据加载速度，因为不需要符号映射。

path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
instrument_id = InstrumentId.from_str("ES.n.0")  # This should be the raw symbol (update) # 这应该是原始符号（更新）

depth10 = loader.from_dbn_file(
    path=path,
    instrument_id=instrument_id,
    as_legacy_cython=False,
)

# Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)
# 将数据写入目录（目前写入 MBP-10 大约需要 20 秒或每秒 250,000 个）
catalog.write_data(depth10)


# Test reading from catalog
# 测试从目录读取
depths = catalog.order_book_depth10()
len(depths)

Preparing a month of AAPL trades 准备一个月的 AAPL 交易

Now we'll expand on this workflow by preparing a month of AAPL trades on the Nasdaq exchange using the Databento trade schema, which will translate to Nautilus TradeTick objects.

现在，我们将通过使用 Databento 交易架构准备 Nasdaq 交易所一个月的 AAPL 交易来扩展此工作流程，这将转换为 Nautilus TradeTick 对象。

# Request cost quote (USD) - this endpoint is 'free'
# 请求成本报价（美元）- 此端点是“免费的”
client.metadata.get_cost(
    dataset="XNAS.ITCH",
    symbols=["AAPL"],
    schema="trades",
    start="2024-01",
)

When requesting historical data with the Databento Historical data client, ensure you pass a path parameter to write the data to disk.

使用 Databento 历史数据客户端请求历史数据时，请确保传递 path 参数以将数据写入磁盘。

path = DATABENTO_DATA_DIR / "aapl-xnas-202401.trades.dbn.zst"

if not path.exists():
    # Request data
    # 请求数据
    client.timeseries.get_range(
        dataset="XNAS.ITCH",
        symbols=["AAPL"],
        schema="trades",
        start="2024-01",
        path=path,  # <-- Passing a `path` parameter 传递 `path` 参数
    )

Inspect the data by reading from disk and convert to a pandas.DataFrame.

通过从磁盘读取数据并转换为 pandas.DataFrame 来检查数据。

data = DBNStore.from_file(path)

df = data.to_df()
df

We'll use an InstrumentId of "AAPL.XNAS", where XNAS is the ISO 10383 MIC (Market Identifier Code) for the Nasdaq venue.

我们将使用“AAPL.XNAS”的 InstrumentId，其中 XNAS 是 Nasdaq 交易平台的 ISO 10383 MIC（市场标识符代码）。

While passing an instrument_id to the loader isn't strictly necessary, it speeds up data loading by eliminating the need for symbology mapping. Additionally, setting the as_legacy_cython option to False further optimizes the process since we'll be writing the loaded data to the catalog. Although we could use legacy Cython objects, this method is more efficient for loading.

虽然将 instrument_id 传递给加载器并非严格必要，但它通过消除对符号映射的需求来加快数据加载速度。此外，将 as_legacy_cython 选项设置为 False 可以进一步优化过程，因为我们将把加载的数据写入目录。虽然我们可以使用旧的 Cython 对象，但此方法加载效率更高。

instrument_id = InstrumentId.from_str("AAPL.XNAS")

trades = loader.from_dbn_file(
    path=path,
    instrument_id=instrument_id,
    as_legacy_cython=False,
)

Here we'll organize our data as a file per month, this is an arbitrary choice as a file per day could be just as valid.

在这里，我们将数据组织为每月一个文件，这是一个任意选择，因为每天一个文件也同样有效。

It may also be a good idea to create a function which can return the correct basename_template value for a given chunk of data.

创建一个可以为给定数据块返回正确的 basename_template 值的函数可能也是一个好主意。

# Write data to catalog
# 将数据写入目录
catalog.write_data(trades, basename_template="2024-01")

trades = catalog.trade_ticks([instrument_id])

len(trades)