components llm_rag_crawl_url - Azure/azureml-assets GitHub Wiki
Crawls the given URL and nested links to max_crawl_depth
. Data is stored to output_path
.
Version: 0.0.32
Preview
View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_crawl_url/version/0.0.32
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
url | URL to crawl. | string | False | ||
max_crawl_depth | Maximum depth to crawl. 0 doesn't crawl any nested links. | integer | 1 | True | |
max_crawl_time | Maximum time in seconds to crawl. | integer | 60 | True | |
max_download_time | Maximum time in seconds to wait for a page to download. | integer | 15 | True | |
max_file_size | Maximum file size in bytes to download. | integer | 5000000 | True | |
max_redirects | Maximum number of redirects to follow. | integer | 3 | True | |
max_files | Maximum number of files to download. | integer | 1000 | True | |
support_http | Whether to support crawling http links. | boolean | False | True |
Name | Description | Type |
---|---|---|
output_path | Where to save crawled data. | uri_folder |
azureml:llm-rag-embeddings@latest