Python_5 - jjin-choi/study_note GitHub Wiki

ยง Web scraping

https://github.com/imguru-mooc/python_intermediate/tree/master/3.%EB%A9%94%ED%83%80%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D

1. ์›น ์„œ๋ฒ„์— ์š”์ฒญํ•˜๊ณ  ์‘๋‹ต๋ฐ›๊ธฐ

  • ๋ธŒ๋ผ์šฐ์ € -----โ†’ request -----โ†’ ์›น์„œ๋ฒ„
  • ๋ธŒ๋ผ์šฐ์ € โ†---- response โ†---- ์›น์„œ๋ฒ„
  1 import requests
  2  
  3 url = "http://github.seccae.com/"
  4 resp = requests.get(url)
  5 print (resp)
  6 print (resp.__dict__) 
  7 
  8 html = resp.text
  9 print (html) 
 10 url2 = "http://github.seccae.com/" 
 11 resp2 = requests.get(url2)
 12 print (resp2)  
  • ๋กœ๋ด‡ ๋ฐฐ์ œ ํ‘œ์ค€ (robots.txt)

    • http://naver.com/robots.txt ๋ผ๊ณ  ์น˜๋ฉด naver์˜ robots.txt ๊ฐ€ ๋‹ค์šด๋กœ๋“œ ๋œ๋‹ค.
    • ๋ฃจํŠธ๋Š” ํ—ˆ๋ฝ๋˜์ง€ ์•Š์ง€๋งŒ /$ (๋งจ๋’ค) ๋กœ ๋๋‚˜๋Š” directory ๋Š” ํ—ˆ๋ฝํ•œ๋‹ค. ์ฆ‰ root directory ์ „์ฒด๋ฅผ ์˜คํ”ˆํ•œ ๊ฒƒ.
  • ์›น ์‚ฌ์ดํŠธ์—์„œ ์ง์ ‘ ํด๋ฆญํ•˜์ง€ ์•Š๋”๋ผ๋„ parser ๋ฅผ ์“ฐ๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๊ธ์–ด์˜ฌ ์ˆ˜ ์žˆ๋‹ค.

User-agent: *
Disallow: /
Allow : /$ 
  1 import requests
  2  
  3 url = "http://github.seccae.com/"
  4 filename = "robots.txt"
  5  
  6 file_path = url + filename
  7 print (file_path)
  8 resp = requests.get(file_path)
  9 print (resp.text)

2. BeautifulSoup ๊ฐ์ฒด ๋งŒ๋“ค๊ธฐ

  • beautifulsoup ์€ python ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” html parser
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Seoul_Metropolitan_Subway"
resp = requests.get(url)
html_src = resp.text

soup = BeautifulSoup(html_src, 'html.parser')
print(type(soup))
print("\n")
                  
print(soup.head)
print("\n")
print(soup.body)
print("\n")

print('title ํƒœ๊ทธ ์š”์†Œ: ', soup.title)
print('title ํƒœ๊ทธ ์ด๋ฆ„: ', soup.title.name)
print('title ํƒœ๊ทธ ๋ฌธ์ž์—ด: ', soup.title.string)