CrawlingIndustryReport - GachonCapstoneTeam/TTS_JAVA GitHub Wiki

๐Ÿ“„ Crawling - Indusry Report

Industry Report Crawling ๋„ค์ด๋ฒ„ ๋ฆฌํฌํŠธ์—์„œ ์ œ๊ณตํ•˜๋Š” ์‚ฐ์—…๋ถ„์„ ๋ฆฌํฌํŠธ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ํฌ๋กค๋ง ํ•œ ๊ฒƒ์œผ๋กœ, ์‚ฐ์—…๋ถ„์„ ๋ฆฌํฌํŠธ์˜ ํ™”๋ฉด ๊ตฌ์„ฑ์— ๋งž์ถฐ์„œ ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. BeautifulSoup์„ ์‚ฌ์šฉํ•˜์—ฌ ํฌ๋กค๋งํ•˜์˜€์œผ๋ฉฐ, ํฌ๋กค๋ง๋œ ๋ฐ์ดํ„ฐ๋Š” mongoDB์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.


โœ… ์ฃผ์š” ๊ธฐ๋Šฅ ์š”์•ฝ

๊ธฐ๋Šฅ ์„ค๋ช…
์‚ฐ์—…๋ถ„์„ ๋ฆฌํฌํŠธ ํฌ๋กค๋ง ๋„ค์ด๋ฒ„ ์ฆ๊ถŒ ์‚ฐ์—…๋ถ„์„ ๋ฆฌํฌํŠธ์˜ ๋ฆฌ์ŠคํŠธ ํŽ˜์ด์ง€๋ฅผ ์ˆœํšŒํ•˜๋ฉฐ ์—…์ข…๋ช…, ์ œ๋ชฉ, ์ž‘์„ฑ์ผ, ์š”์•ฝ, PDF ๋“ฑ ์ฃผ์š” ์ •๋ณด๋ฅผ ์ˆ˜์ง‘
PDF ๋‚ด์šฉ ์ถ”์ถœ ๊ฐ ๋ฆฌํฌํŠธ์— ํฌํ•จ๋œ PDF ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ํ…์ŠคํŠธ ๋‚ด์šฉ์„ ์ถ”์ถœํ•˜์—ฌ ์ €์žฅ
HTML ๋ณธ๋ฌธ ๋‚ด์šฉ ์ˆ˜์ง‘ ๋ฆฌํฌํŠธ ์ƒ์„ธ ํŽ˜์ด์ง€์—์„œ HTML ํ˜•ํƒœ์˜ ๋ณธ๋ฌธ ํ…์ŠคํŠธ๋ฅผ ์ถ”๊ฐ€๋กœ ์ˆ˜์ง‘
์ฆ๊ถŒ์‚ฌ ํ•„ํ„ฐ๋ง pdf.SECURITIES_CONFIGS ๋ชฉ๋ก์— ํฌํ•จ๋œ ์ฃผ์š” ์ฆ๊ถŒ์‚ฌ ๋ฆฌํฌํŠธ๋งŒ ํฌ๋กค๋ง ๋Œ€์ƒ์— ํฌํ•จ
MongoDB์— ์ €์žฅ ํฌ๋กค๋ง๋œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ MongoDB ์ปฌ๋ ‰์…˜์— ์ €์žฅํ•˜์—ฌ ๋ถ„์„ ๋ฐ ์กฐํšŒ์— ํ™œ์šฉ ๊ฐ€๋Šฅ

๋ฆฌํฌํŠธ แ„แ…ณแ„…แ…ฉแ†ฏแ„…แ…ตแ†ผ


๐Ÿ”ง ์ฃผ์š” ๋กœ์ง ์„ค๋ช…

์ „์ฒด ๋ฆฌํฌํŠธ ํฌ๋กค๋ง ๋ฐ ์ €์žฅ - fetch_industry_reports()

def fetch_industry_reports(category_name, category_url, pages)
  • ์‚ฐ์—…๋ถ„์„ ๋ฆฌํฌํŠธ๋ฅผ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ์ˆ˜์ง‘ํ•˜์—ฌ ์ข…๋ชฉ๋ช…, ์ฆ๊ถŒ์‚ฌ, ์—…์ข…, ๋ณธ๋ฌธ ๋‚ด์šฉ ๋“ฑ์„ ํฌํ•จํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜

for page in range(1, pages + 1):
    url = f"{category_url}?&page={page}"
    ...
  • ์ฃผ์–ด์ง„ ํŽ˜์ด์ง€ ์ˆ˜๋งŒํผ ๋ฐ˜๋ณตํ•˜์—ฌ, ๊ฐ ๋ฆฌํฌํŠธ ๋ฆฌ์ŠคํŠธ ํŽ˜์ด์ง€๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์š”์ฒญํ•˜๊ณ  ํŒŒ์‹ฑ

 if not table:
            print(f"Table not found for URL: {url}")
            continue
  • ํ…Œ์ด๋ธ”์ด ์—†๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ํŽ˜์ด์ง€๋Š” ๊ฑด๋„ˆ๋œ€

for row in table.find_all("tr")[2:]:
    ...
  • ํ—ค๋”๋ฅผ ์ œ์™ธํ•œ ์š”์†Œ์—์„œ ๋ฆฌํฌํŠธ ์ •๋ณด๋ฅผ ์ถ”์ถœ

stock_link = cols[0].find("a", class_="stock_item")
            if stock_link:
                item_name = stock_link.text.strip()  # ์ข…๋ชฉ๋ช…
                code = stock_link["href"].split("=")[-1]  # ์ข…๋ชฉ ์ฝ”๋“œ
            else:
                item_name = cols[0].text.strip()
                code = None  # ์ฝ”๋“œ ์ •๋ณด ์—†์Œ
  • ์‚ฐ์—…๋ถ„์„ ๋ฆฌํฌํŠธ๋Š” ์ข…๋ชฉ์— ๋”ฐ๋ฅธ ๋ฆฌํฌํŠธ๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ฝ”๋“œ๊ฐ€ ๋”ฐ๋กœ ์—†์Œ

item_name = cols[0].text.strip() # ์—ฌ๊ธฐ์„œ item_name์€ ์—…์ข…๋ช…์ด๋ฏ€๋กœ '์—…์ข…'์— ์ €์žฅ
title = cols[1].text.strip()
detail_link = cols[1].find("a")["href"]
detail_url = f"https://finance.naver.com/research/{detail_link}" if not detail_link.startswith(
                "http") else detail_link
company = cols[2].text.strip() # ์ฆ๊ถŒ์‚ฌ
  • ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ ํ•„์š”ํ•œ ์ •๋ณด๋“ค์„ ํฌ๋กค๋ง

pdf_content = "" if pdf_url == "PDF ์—†์Œ" else pdf.download_and_process_pdf2(pdf_url,company)
report_content = fetch_report_details(detail_url)
  • PDF URL์ด ์กด์žฌํ•  ๊ฒฝ์šฐ ํ•ด๋‹น PDF๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ download_and_process_pdf2ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜
  • HTML ๋ฆฌํฌํŠธ ๋ณธ๋ฌธ๋„ ํ•จ๊ป˜ ์ˆ˜์ง‘

reports.append({
                'Category': category_name,
                '์ข…๋ชฉ๋ช…': "",  # ์‚ฐ์—… ๋ถ„์„ ๋ฆฌํฌํŠธ๋Š” ์ข…๋ชฉ๋ช…์„ ๋น„์›Œ๋‘ 
                '์—…์ข…': item_name,  # item_name์„ ์—…์ข…์œผ๋กœ ์ €์žฅ
                'Title': title,
                '์ฆ๊ถŒ์‚ฌ': company,
                'PDF URL': pdf_url,
                '์ž‘์„ฑ์ผ': date,
                'Views': views,
                'Content': report_content,
                'PDF Content': pdf_content,
            })
})
  • ํ•˜๋‚˜์˜ ๋ฆฌํฌํŠธ๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑํ•˜๊ณ , ์ „์ฒด ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€

๐Ÿ”— ๊ด€๋ จ ๋ ˆํŒŒ์ง€ํ† ๋ฆฌ