Databento data catalog Databento - Loren1166/NautilusTrader- GitHub Wiki
Databento data catalog Databento 数据目录
Tutorial for NautilusTrader a high-performance algorithmic trading platform and event driven backtester.
NautilusTrader 教程,一个高性能算法交易平台和事件驱动的回测器。
View source on GitHub.
在 GitHub 上查看源代码。
info 信息
We are currently working on this article.
我们目前正在撰写本文。
Overview 概述
This tutorial will walk through how to set up a Nautilus Parquet data catalog with various Databento schemas.
本教程将逐步介绍如何使用各种 Databento 架构设置 Nautilus Parquet 数据目录。
Prerequisites 先决条件
- Python 3.10+ installed 安装 Python 3.10+
- JupyterLab or similar installed (
pip install -U jupyterlab) 安装 JupyterLab 或类似软件 (pip install -U jupyterlab)- NautilusTrader latest release installed (
pip install -U nautilus_trader) 安装 NautilusTrader 最新版本 (pip install -U nautilus_trader)- databento Python client library installed to make data requests (
pip install -U databento) 安装 databento Python 客户端库以发出数据请求 (pip install -U databento)- Databento account Databento 账户
Requesting data 请求数据
We'll use a Databento historical client for the rest of this tutorial. You can either initialize one by passing your Databento API key to the constructor, or implicitly use the
DATABENTO_API_KEYenvironment variable (as shown).在本教程的其余部分中,我们将使用 Databento 历史客户端。您可以通过将您的 Databento API 密钥传递给构造函数来初始化一个,或者隐式使用
DATABENTO_API_KEY环境变量(如下所示)。
import databento as db
client = db.Historical() # This will use the DATABENTO_API_KEY environment variable (recommended best practice) # 这将使用 DATABENTO_API_KEY 环境变量(推荐的最佳实践)
It's important to note that every historical streaming request from
timeseries.get_rangewill incur a cost (even for the same data), therefore we need to:重要的是要注意,
timeseries.get_range的每个历史流请求都会产生成本(即使是相同的数据),因此我们需要:
- Know and understand the cost prior to making a request. 在发出请求之前了解并理解成本。
- Not make requests for the same data more than once (not efficient). 不要多次请求相同的数据(效率不高)。
- Persist the responses to disk by writing zstd compressed DBN files (so that we don't have to request again). 通过写入 zstd 压缩的 DBN 文件将响应持久化到磁盘(这样我们就不必再次请求)。
We can use a metadata
get_costendpoint from the Databento API to get a quote on the cost, prior to each request. Each request sequence will first request the cost of the data, and then make a request only if the data doesn't already exist on disk.我们可以使用 Databento API 中的元数据
get_cost端点在每次请求之前获取成本报价。每个请求序列将首先请求数据的成本,然后仅当数据尚不存在于磁盘上时才发出请求。
Note the response returned is in USD, displayed as fractional cents.
请注意,返回的响应以美元为单位,显示为小数美分。
The following request is only for a small amount of data (as used in this Medium article Building high-frequency trading signals in Python with Databento and sklearn), just to demonstrate the basic workflow.
以下请求仅针对少量数据(如本 Medium 文章中所用:使用 Databento 和 sklearn 在 Python 中构建高频交易信号),只是为了演示基本工作流程。
from pathlib import Path
from databento import DBNStore
We'll prepare a directory for the raw Databento DBN format data, which we'll use for the rest of the tutorial.
我们将准备一个目录,用于存放原始 Databento DBN 格式的数据,我们将在本教程的其余部分中使用它。
DATABENTO_DATA_DIR = Path("databento")
DATABENTO_DATA_DIR.mkdir(exist_ok=True)
# Request cost quote (USD) - this endpoint is 'free'
# 请求成本报价(美元)- 此端点是“免费的”
client.metadata.get_cost(
dataset="GLBX.MDP3",
symbols=["ES.n.0"],
stype_in="continuous",
schema="mbp-10",
start="2023-12-06T14:30:00",
end="2023-12-06T20:30:00",
)
Use the historical API to request for the data used in the Medium article.
使用历史 API 请求 Medium 文章中使用的数据。
path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
if not path.exists():
# Request data
# 请求数据
client.timeseries.get_range(
dataset="GLBX.MDP3",
symbols=["ES.n.0"],
stype_in="continuous",
schema="mbp-10",
start="2023-12-06T14:30:00",
end="2023-12-06T20:30:00",
path=path, # <-- Passing a `path` parameter will ensure the data is written to disk 传递 `path` 参数将确保数据写入磁盘
)
Inspect the data by reading from disk and convert to a
pandas.DataFrame.通过从磁盘读取数据并转换为
pandas.DataFrame来检查数据。
data = DBNStore.from_file(path)
df = data.to_df()
df
Write to data catalog 写入数据目录
import shutil
from pathlib import Path
from nautilus_trader.adapters.databento.loaders import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog
CATALOG_PATH = Path.cwd() / "catalog"
# Clear if it already exists
# 如果已存在,则清除
if CATALOG_PATH.exists():
shutil.rmtree(CATALOG_PATH)
CATALOG_PATH.mkdir()
# Create a catalog instance
# 创建目录实例
catalog = ParquetDataCatalog(CATALOG_PATH)
Now that we've prepared the data catalog, we need a
DatabentoDataLoaderwhich we'll use to decode and load the data into Nautilus objects.现在我们已经准备好了数据目录,我们需要一个
DatabentoDataLoader,我们将使用它来解码数据并将其加载到 Nautilus 对象中。
loader = DatabentoDataLoader()
Next, we'll load Rust pyo3 objects to write to the catalog (we could use legacy Cython objects, but this is slightly more efficient) by setting
as_legacy_cython=False.接下来,我们将通过设置
as_legacy_cython=False加载 Rust pyo3 对象以写入目录(我们可以使用旧的 Cython 对象,但这效率略低)。
We also pass an
instrument_id, which is not required but makes data loading faster as symbology mapping is not required.我们还传递了一个
instrument_id,这不是必需的,但可以加快数据加载速度,因为不需要符号映射。
path = DATABENTO_DATA_DIR / "es-front-glbx-mbp10.dbn.zst"
instrument_id = InstrumentId.from_str("ES.n.0") # This should be the raw symbol (update) # 这应该是原始符号(更新)
depth10 = loader.from_dbn_file(
path=path,
instrument_id=instrument_id,
as_legacy_cython=False,
)
# Write data to catalog (this takes ~20 seconds or ~250,000/second for writing MBP-10 at the moment)
# 将数据写入目录(目前写入 MBP-10 大约需要 20 秒或每秒 250,000 个)
catalog.write_data(depth10)
# Test reading from catalog
# 测试从目录读取
depths = catalog.order_book_depth10()
len(depths)
Preparing a month of AAPL trades 准备一个月的 AAPL 交易
Now we'll expand on this workflow by preparing a month of AAPL trades on the Nasdaq exchange using the Databento trade schema, which will translate to Nautilus
TradeTickobjects.现在,我们将通过使用 Databento 交易架构准备 Nasdaq 交易所一个月的 AAPL 交易来扩展此工作流程,这将转换为 Nautilus
TradeTick对象。
# Request cost quote (USD) - this endpoint is 'free'
# 请求成本报价(美元)- 此端点是“免费的”
client.metadata.get_cost(
dataset="XNAS.ITCH",
symbols=["AAPL"],
schema="trades",
start="2024-01",
)
When requesting historical data with the Databento Historical data client, ensure you pass a
pathparameter to write the data to disk.使用 Databento 历史数据客户端请求历史数据时,请确保传递
path参数以将数据写入磁盘。
path = DATABENTO_DATA_DIR / "aapl-xnas-202401.trades.dbn.zst"
if not path.exists():
# Request data
# 请求数据
client.timeseries.get_range(
dataset="XNAS.ITCH",
symbols=["AAPL"],
schema="trades",
start="2024-01",
path=path, # <-- Passing a `path` parameter 传递 `path` 参数
)
Inspect the data by reading from disk and convert to a
pandas.DataFrame.通过从磁盘读取数据并转换为
pandas.DataFrame来检查数据。
data = DBNStore.from_file(path)
df = data.to_df()
df
We'll use an
InstrumentIdof "AAPL.XNAS", where XNAS is the ISO 10383 MIC (Market Identifier Code) for the Nasdaq venue.我们将使用“AAPL.XNAS”的
InstrumentId,其中 XNAS 是 Nasdaq 交易平台的 ISO 10383 MIC(市场标识符代码)。
While passing an
instrument_idto the loader isn't strictly necessary, it speeds up data loading by eliminating the need for symbology mapping. Additionally, setting theas_legacy_cythonoption toFalsefurther optimizes the process since we'll be writing the loaded data to the catalog. Although we could use legacy Cython objects, this method is more efficient for loading.虽然将
instrument_id传递给加载器并非严格必要,但它通过消除对符号映射的需求来加快数据加载速度。此外,将as_legacy_cython选项设置为False可以进一步优化过程,因为我们将把加载的数据写入目录。虽然我们可以使用旧的 Cython 对象,但此方法加载效率更高。
instrument_id = InstrumentId.from_str("AAPL.XNAS")
trades = loader.from_dbn_file(
path=path,
instrument_id=instrument_id,
as_legacy_cython=False,
)
Here we'll organize our data as a file per month, this is an arbitrary choice as a file per day could be just as valid.
在这里,我们将数据组织为每月一个文件,这是一个任意选择,因为每天一个文件也同样有效。
It may also be a good idea to create a function which can return the correct
basename_templatevalue for a given chunk of data.创建一个可以为给定数据块返回正确的
basename_template值的函数可能也是一个好主意。
# Write data to catalog
# 将数据写入目录
catalog.write_data(trades, basename_template="2024-01")
trades = catalog.trade_ticks([instrument_id])
len(trades)