CrawlingStocksReports - GachonCapstoneTeam/TTS_JAVA GitHub Wiki

๐Ÿ“„ Crawling Stock Report

Stock Report Crawling ๋„ค์ด๋ฒ„ ๋ฆฌํฌํŠธ์—์„œ ์ œ๊ณตํ•˜๋Š” ์ข…๋ชฉ๋ถ„์„ ๋ฆฌํฌํŠธ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ํฌ๋กค๋ง ํ•œ ๊ฒƒ์œผ๋กœ, ์ข…๋ชฉ๋ถ„์„ ๋ฆฌํฌํŠธ์˜ ํ™”๋ฉด ๊ตฌ์„ฑ์— ๋งž์ถฐ์„œ ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. BeautifulSoup์„ ์‚ฌ์šฉํ•˜์—ฌ ํฌ๋กค๋งํ•˜์˜€์œผ๋ฉฐ, ํฌ๋กค๋ง๋œ ๋ฐ์ดํ„ฐ๋Š” mongoDB์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.


โœ… ์ฃผ์š” ๊ธฐ๋Šฅ ์š”์•ฝ

๊ธฐ๋Šฅ ์„ค๋ช…
์ข…๋ชฉ๋ถ„์„ ๋ฆฌํฌํŠธ ํฌ๋กค๋ง ๋„ค์ด๋ฒ„ ์ฆ๊ถŒ ์ข…๋ชฉ๋ถ„์„ ๋ฆฌํฌํŠธ์˜ ๋ฆฌ์ŠคํŠธ ํŽ˜์ด์ง€๋ฅผ ์ˆœํšŒํ•˜๋ฉฐ ๋ณด๊ณ ์„œ์˜ ์ œ๋ชฉ, ์š”์•ฝ, ๋‚ ์งœ ๋“ฑ์„ ์ˆ˜์ง‘
๋ณธ๋ฌธ ๋‚ด์šฉ ์ˆ˜์ง‘ ๊ฐ ๋ฆฌํฌํŠธ ์ƒ์„ธ ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•˜์—ฌ ๋ณธ๋ฌธ ๋‚ด์šฉ์„ ํŒŒ์‹ฑํ•˜์—ฌ ์ˆ˜์ง‘
PDF ๋‚ด์šฉ ๋ณ‘ํ•ฉ PDF ๋ฆฌํฌํŠธ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น PDF๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ๋ณธ๋ฌธ ๋‚ด์šฉ๊ณผ ๋ณ‘ํ•ฉ
MongoDB์— ์ €์žฅ ํฌ๋กค๋ง๋œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ MongoDB ์ปฌ๋ ‰์…˜์— ์ €์žฅํ•˜์—ฌ ๋ถ„์„ ๋ฐ ์กฐํšŒ์— ํ™œ์šฉ ๊ฐ€๋Šฅ
์ค‘๋ณต ์ œ๊ฑฐ DB์— ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋ฆฌํฌํŠธ๋Š” ์ œ์™ธํ•˜๊ณ  ์‹ ๊ทœ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฝ์ž…

๋ฆฌํฌํŠธ แ„แ…ณแ„…แ…ฉแ†ฏแ„…แ…ตแ†ผ


๐Ÿ”ง ์ฃผ์š” ๋กœ์ง ์„ค๋ช…

์ „์ฒด ๋ฆฌํฌํŠธ ํฌ๋กค๋ง ๋ฐ ์ €์žฅ - fetch_stock_reports()

def fetch_stock_reports(category_name, category_url, pages, industry_data)
  • ์ข…๋ชฉ๋ถ„์„ ๋ฆฌํฌํŠธ๋ฅผ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ์ˆ˜์ง‘ํ•˜์—ฌ ์ข…๋ชฉ๋ช…, ์ฆ๊ถŒ์‚ฌ, ์—…์ข…, ๋ณธ๋ฌธ ๋‚ด์šฉ ๋“ฑ์„ ํฌํ•จํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜

for page in range(1, pages + 1):
    url = f"{category_url}?&page={page}"
    ...
  • ์ฃผ์–ด์ง„ ํŽ˜์ด์ง€ ์ˆ˜๋งŒํผ ๋ฐ˜๋ณตํ•˜์—ฌ, ๊ฐ ๋ฆฌํฌํŠธ ๋ฆฌ์ŠคํŠธ ํŽ˜์ด์ง€๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์š”์ฒญํ•˜๊ณ  ํŒŒ์‹ฑ

table = soup.find("table", {"class": "type_1"})
  • ๋„ค์ด๋ฒ„ ๋ฆฌํฌํŠธ ํŽ˜์ด์ง€ ๋‚ด์—์„œ ์‹ค์ œ ๋ฆฌํฌํŠธ๊ฐ€ ๋‚˜์—ด๋œ table ํƒœ๊ทธ๋ฅผ ํƒ์ƒ‰

for row in table.find_all("tr")[2:]:
    ...
  • ํ—ค๋”๋ฅผ ์ œ์™ธํ•œ ์š”์†Œ์—์„œ ๋ฆฌํฌํŠธ ์ •๋ณด๋ฅผ ์ถ”์ถœ

stock_link = cols[0].find("a", class_="stock_item")
  • ์ข…๋ชฉ๋ช…์ด ํƒœ๊ทธ๋กœ ๋ฌถ์—ฌ ์žˆ์„ ๊ฒฝ์šฐ ๋งํฌ์—์„œ ์ข…๋ชฉ ์ฝ”๋“œ์™€ ์ด๋ฆ„์„ ์ถ”์ถœํ•˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ๋Š” ํ…์ŠคํŠธ๋งŒ ์‚ฌ์šฉ

pdf_content = "" if pdf_url == "PDF ์—†์Œ" else pdf.download_and_process_pdf2(pdf_url, company)
report_content = fetch_report_details(detail_url)
  • PDF URL์ด ์กด์žฌํ•  ๊ฒฝ์šฐ ํ•ด๋‹น PDF๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ download_and_process_pdf2ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜
  • HTML ๋ฆฌํฌํŠธ ๋ณธ๋ฌธ๋„ ํ•จ๊ป˜ ์ˆ˜์ง‘

industry_list = [
    industry for industry, stock_list in industry_data.items() if item_name in stock_list
]
industry_value = industry_list[0] if industry_list else "Unknown"
  • ์ข…๋ชฉ๋ช…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์—…์ข… ์ •๋ณด๋ฅผ ๋งค์นญ ์—…์ข… ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์„ ๊ฒฝ์šฐ "Unknown"์œผ๋กœ ๋ถ„๋ฅ˜

reports.append({
    'Category': category_name,
    '์ข…๋ชฉ๋ช…': item_name,
    '์ฝ”๋“œ': code,
    '์—…์ข…': industry_value,
    'Title': title,
    '์ฆ๊ถŒ์‚ฌ': company,
    'PDF URL': pdf_url,
    '์ž‘์„ฑ์ผ': date,
    'Views': views,
    'Content': report_content,
    'PDF Content': pdf_content,
})
  • ํ•˜๋‚˜์˜ ๋ฆฌํฌํŠธ๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑํ•˜๊ณ , ์ „์ฒด ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€

๐Ÿ”— ๊ด€๋ จ ๋ ˆํŒŒ์ง€ํ† ๋ฆฌ