Python_5 - jjin-choi/study_note GitHub Wiki
ยง Web scraping
1. ์น ์๋ฒ์ ์์ฒญํ๊ณ ์๋ต๋ฐ๊ธฐ
- ๋ธ๋ผ์ฐ์ -----โ request -----โ ์น์๋ฒ
- ๋ธ๋ผ์ฐ์ โ---- response โ---- ์น์๋ฒ
1 import requests
2
3 url = "http://github.seccae.com/"
4 resp = requests.get(url)
5 print (resp)
6 print (resp.__dict__)
7
8 html = resp.text
9 print (html)
10 url2 = "http://github.seccae.com/"
11 resp2 = requests.get(url2)
12 print (resp2)
-
๋ก๋ด ๋ฐฐ์ ํ์ค (robots.txt)
- http://naver.com/robots.txt ๋ผ๊ณ ์น๋ฉด naver์ robots.txt ๊ฐ ๋ค์ด๋ก๋ ๋๋ค.
- ๋ฃจํธ๋ ํ๋ฝ๋์ง ์์ง๋ง /$ (๋งจ๋ค) ๋ก ๋๋๋ directory ๋ ํ๋ฝํ๋ค. ์ฆ root directory ์ ์ฒด๋ฅผ ์คํํ ๊ฒ.
-
์น ์ฌ์ดํธ์์ ์ง์ ํด๋ฆญํ์ง ์๋๋ผ๋ parser ๋ฅผ ์ฐ๋ฉด ๋ฐ์ดํฐ๋ฅผ ์ ๊ธ์ด์ฌ ์ ์๋ค.
User-agent: *
Disallow: /
Allow : /$
1 import requests
2
3 url = "http://github.seccae.com/"
4 filename = "robots.txt"
5
6 file_path = url + filename
7 print (file_path)
8 resp = requests.get(file_path)
9 print (resp.text)
2. BeautifulSoup ๊ฐ์ฒด ๋ง๋ค๊ธฐ
- beautifulsoup ์ python ์์ ์ ๊ณตํ๊ณ ์๋ html parser
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Seoul_Metropolitan_Subway"
resp = requests.get(url)
html_src = resp.text
soup = BeautifulSoup(html_src, 'html.parser')
print(type(soup))
print("\n")
print(soup.head)
print("\n")
print(soup.body)
print("\n")
print('title ํ๊ทธ ์์: ', soup.title)
print('title ํ๊ทธ ์ด๋ฆ: ', soup.title.name)
print('title ํ๊ทธ ๋ฌธ์์ด: ', soup.title.string)