components llm_rag_crawl_url - Azure/azureml-assets GitHub Wiki

LLM - Crawl URL to Retrieve Data

llm_rag_crawl_url

Overview

Crawls the given URL and nested links to max_crawl_depth. Data is stored to output_path.

Version: 0.0.32

Tags

Preview

View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_crawl_url/version/0.0.32

Inputs

Name Description Type Default Optional Enum
url URL to crawl. string False
max_crawl_depth Maximum depth to crawl. 0 doesn't crawl any nested links. integer 1 True
max_crawl_time Maximum time in seconds to crawl. integer 60 True
max_download_time Maximum time in seconds to wait for a page to download. integer 15 True
max_file_size Maximum file size in bytes to download. integer 5000000 True
max_redirects Maximum number of redirects to follow. integer 3 True
max_files Maximum number of files to download. integer 1000 True
support_http Whether to support crawling http links. boolean False True

Outputs

Name Description Type
output_path Where to save crawled data. uri_folder

Environment

azureml:llm-rag-embeddings@latest

⚠️ **GitHub.com Fallback** ⚠️