Python_5 - jjin-choi/study_note GitHub Wiki

§ Web scraping

https://github.com/imguru-mooc/python_intermediate/tree/master/3.%EB%A9%94%ED%83%80%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D

1. 웹 서버에 요청하고 응답받기

브라우저 -----→ request -----→ 웹서버
브라우저 ←---- response ←---- 웹서버

  1 import requests
  2  
  3 url = "http://github.seccae.com/"
  4 resp = requests.get(url)
  5 print (resp)
  6 print (resp.__dict__) 
  7 
  8 html = resp.text
  9 print (html) 
 10 url2 = "http://github.seccae.com/" 
 11 resp2 = requests.get(url2)
 12 print (resp2)

로봇 배제 표준 (robots.txt)
- http://naver.com/robots.txt 라고 치면 naver의 robots.txt 가 다운로드 된다.
- 루트는 허락되지 않지만 /$ (맨뒤) 로 끝나는 directory 는 허락한다. 즉 root directory 전체를 오픈한 것.
웹 사이트에서 직접 클릭하지 않더라도 parser 를 쓰면 데이터를 잘 긁어올 수 있다.

User-agent: *
Disallow: /
Allow : /$

  1 import requests
  2  
  3 url = "http://github.seccae.com/"
  4 filename = "robots.txt"
  5  
  6 file_path = url + filename
  7 print (file_path)
  8 resp = requests.get(file_path)
  9 print (resp.text)

2. BeautifulSoup 객체 만들기

beautifulsoup 은 python 에서 제공하고 있는 html parser

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Seoul_Metropolitan_Subway"
resp = requests.get(url)
html_src = resp.text

soup = BeautifulSoup(html_src, 'html.parser')
print(type(soup))
print("\n")
                  
print(soup.head)
print("\n")
print(soup.body)
print("\n")

print('title 태그 요소: ', soup.title)
print('title 태그 이름: ', soup.title.name)
print('title 태그 문자열: ', soup.title.string)