Web Crawling - kamchur/note GitHub Wiki

[์ฐธ์กฐ ๋งํฌ]
server status

  • crawling : ํŽ˜์ด์ง€์˜ ํŠน์ • ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…

  • scraping : ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ์ž‘์—…

  • bot : ํŽ˜์ด์ง€ ์ •๋ณด๋ฅผ DB์— ์ €์žฅํ•ด๋†“์Œ, ์œ ์ €๊ฐ€ ์š”์ฒญ์‹œ DB์—์„œ ์ถ”์ถœํ•˜์—ฌ ์ „๋‹ฌํ•˜๋ฏ€๋กœ ๊ฒ€์ƒ‰์†๋„๊ฐ€ ๋น ๋ฆ„

  • POST๋ฐฉ์‹์€ payload์— ๋งŽ์ด ๋‹ด์„ ์ˆ˜ ์žˆ๊ณ 

  • GET๋ฐฉ์‹์€ URL์•ˆ์— ๋‹ด๊ธฐ์— ๋งŽ์ด ๋‹ด์„ ์ˆ˜ ์—†์Œ(๋ธŒ๋ผ์šฐ์ €์— ๋”ฐ๋ผ ์ œํ•œ๋œ ๊ธ€์ž ์ˆ˜)

  • array(JSON)์€ 'list'๋ผ๊ณ  ๋ณด๋ฉด ๋จ

  • JSON์€ 'dictionary'๋ผ๊ณ  ๋ณด๋ฉด ๋จ

  • html์€ 10000์ž๋ฅผ ๋ณด๋‚ด๋ฉด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธ€์ž๊ฐ€ 1000์ž์ธ๋ฐ
    json์€ 2000์ž ์ค‘์— 1000์ž์‚ฌ์šฉ, json์€ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์–‘๋„ ์•Œ์ฐจ๊ณ  ์ ์–ด์„œ ์†๋„๊ฐ€ ๋น ๋ฆ„


crawling์ •์ฑ…

url = */robots.txt

User-agent : ํด๋ผ์ด์–ธํŠธ๊ฐ€ ์–ด๋–ค ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ์ •์˜ํ•ด๋†“์€๊ฒƒ
์„œ๋ฒ„์ชฝ์—์„œ ์–ด๋–ค ํด๋ผ์ด์–ธํŠธ๊ฐ€ ์š”์ฒญํ•˜๋Š”์ง€? OS๋Š” ๋ฌด์—‡์ธ์ง€
๋ธŒ๋ผ์šฐ์ €๋Š” ๋ฌด์—‡์ธ์ง€ ๋ฒ„์ „์€ ๋ฌด์—‡์ธ์ง€ ํด๋ผ์ด์–ธํŠธ์—๋Œ€ํ•œ ์ •๋ณด๋ฅผ
User Agent๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ requestํ• ๋•Œ ์„œ๋ฒ„์— ๋ณด๋ƒ„

์„œ๋ฒ„์ชฝ์—์„œ web application ๊ฐœ๋ฐœํ•˜๋Š” ์‚ฌ๋žŒ์ด ์•Œ ์ˆ˜ ์žˆ์Œ
User-agent: *    # ๋ชจ๋“  ์œ ์ €๋“ค์ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ๊ฐ€๋„ ๋œ๋‹ค๋Š” ๋œป
Baiduspider ์ค‘๊ตญ์˜ ๊ตฌ๊ธ€

ํฌ๋กค๋ง์— ๋Œ€ํ•œ ๋ฒ•์  ์ œ์žฌx
๊ณผ๋„ํ•œ ํฌ๋กค๋ง์œผ๋กœ ์„œ๋น„์Šค์— ์˜ํ–ฅ์„ ์ฃผ์—ˆ์„ ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ:์˜์—…๋ฐฉํ•ด
ํšŒ์‚ฌ ๊ณ ์œ ์ž์‚ฐ์—๋Œ€ํ•œ ์นจํ•ด๋Š” ์ง€์ ์žฌ์‚ฐ๊ถŒ ์นจํ•ด๋ฅผ ์คŒ

ํฌ๋กค๋ง์„ ํ•  ๋•Œ๋Š” ๋˜๋„๋ก API๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์ด ์ข‹์Œ
- 1. requests : json : ์›นํŽ˜์ด์ง€ API ํŠธ๋ž˜ํ”ฝ์„ ๋ถ„์„ํ•ด์„œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ : naver stock
- 2. requests : json : ๊ณต์‹์ ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” API๋ฅผ application key ๋ฐ›์•„์„œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ : naver api(papago, trend)
- 3. requests : html, BeautifulSoup(css selector) : ์›นํŽ˜์ด์ง€์˜ html ์ฝ”๋“œ ๋ฐ›์•„์„œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ : gmarket
- 4. selenium : browser - python : ๋ธŒ๋ผ์šฐ์ ธ๋ฅผ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋กœ ์ปจํŠธ๋กคํ•ด์„œ ๋ฐ์ดํ„ฐ์ˆ˜์ง‘ : ted
- ํฌ๋กค๋งํ•  ๋•Œ ์ข‹์€ ์ˆœ์„œ : 2 > 1 > 3 > 4 
html๊ณผ cssํ•˜๊ธฐ
ํ•˜๋Š” ์ด์œ ๋Š”? ๋ธŒ๋ผ์šฐ์ €์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์–ธ์–ด๋Š” ์„ธ๊ฐ€์ง€ ์–ธ์–ด๊ฐ€ ์žˆ๋‹ค
HTML, CSS, JS(Javascript)

> HTML์ด ํ•˜๋Š” ์—ญํ•  
ํ…์ŠคํŠธ ์ •๋ณด, ๋ ˆ์ด์•„์›ƒ์˜ ์œ„์น˜๋ฅผ HTML์ด ์ •์˜ํ•ด์คŒ
ํด๋ฆญํ•˜๋ฉด ์ด๋™๋˜๋Š” ๋งํฌ๋ฅผ ์ •ํ•ด์ฃผ๋Š” ๋™์ž‘๋„

> CSSํ•˜๋Š” ์—ญํ• 
ํ…์ŠคํŠธ์˜ ์‚ฌ์ด์ฆˆ, ๋ฒ„ํŠผ์ด ์žˆ์„๋•Œ ๋ฒ„ํŠผ์˜ ์ƒ‰์ƒ, ํฌ๊ธฐ, ์Šคํƒ€์ผ

> JS ์—ญํ• 
EVENT๋‹ด๋‹น, ๋ฒ„ํŠผ์„ ํด๋ฆญํ–ˆ์„๋•Œ ์–ด๋–ค ๋™์ž‘์„ ํ•  ๊ฒƒ์ธ์ง€

html์˜ ์œ„์น˜๋ฅผ ์•Œ์•„์•ผ 
๋ฐ์ดํ„ฐ๋ฅผ ํด๋ฆญํ•ด์„œ ๋ฐ”๊ฟ”์ฃผ๊ธฐ ์œ„ํ•ด์•Œ์•„์•ผํ•จ
CSS Selector ๋ผ๊ณ ํ•จ 
ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ์•Œ์•„์„œ
Client๊ฐ€ Server์—์„œ ๋ฐ์ดํ„ฐ(html)๋ฅผ ๋ฐ›์•„์˜ฌ๋•Œ
ํŠน์ •์œ„์น˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ CSS Selector๋ฅผ ์ด์šฉํ•ด์„œ
ํŠน์ • ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•  ๊ฒƒ 
์ƒ๊ด€๊ด€๊ณ„๋ถ„์„์ด๋ž€? 
- ๋‘ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ ์‚ฌ์ด์— ์–ด๋–ค ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ถ„์„ ๋ฐฉ๋ฒ•
- '์ƒ๊ด€๊ณ„์ˆ˜'๋กœ ํ™•์ธ์„ ํ•˜๋ฉฐ 'object'ํƒ€์ž…์ด ์•„๋‹Œ '์ˆซ์ž'ํ˜•ํƒœ์—ฌ์•ผ ํ™•์ธ ๊ฐ€๋Šฅ

์ƒ๊ด€๊ด€๊ณ„
`1`๊ณผ ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๊ฐ•ํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„
`-1`๊ณผ ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๊ฐ•ํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„
`0`๊ณผ ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์„œ๋กœ ๊ด€๊ณ„์—†์Œ
URL๋ถ„์„

https:// - Protocol
naver.com - Domain
news - sub Domain
80 - port
/main/ - path 
read.nhn - page
?mode=LSD&mid.... - query
#da_727145 - fragment	> velog์—์„œ ###ํ•˜๊ณ  ํƒœ๊ทธ๋œ ์ด๋ฆ„์„ ๋ˆ„๋ฅด๋ฉด ํ•ด๋‹น ํŽ˜์ด์ง€๋กœ ์ด๋™ ์‹œํ‚ค๋Š”๊ฒƒ๊ณผ ๋™์ผํ•œ ์—ญํ• 

requests

url ํ™•์ธ
๋ฐ์ดํ„ฐ ์š”์ฒญ
๋ฐ์ดํ„ฐ ํŒŒ์‹ฑ

requests example:naver trend search
import pandas as pd
import requests

# ํ†ตํ•ฉ ๊ฒ€์ƒ‰ ํŠธ๋ Œ๋“œ
url = "https://openapi.naver.com/v1/datalab/search"
params = {
    "startDate": "2018-01-01",
    "endDate": "2022-01-31",
    "timeUnit": "month",
    "keywordGroups": [
        {"groupName": "ํŠธ์œ„ํ„ฐ", "keywords": ["ํŠธ์œ„ํ„ฐ", "ํŠธ์œ—"]},
        {"groupName": "ํŽ˜์ด์Šค๋ถ", "keywords": ["ํŽ˜์ด์Šค๋ถ", "ํŽ˜๋ถ"]},
        {"groupName": "์ธ์Šคํƒ€๊ทธ๋žจ", "keywords": ["์ธ์Šคํƒ€๊ทธ๋žจ", "์ธ์Šคํƒ€"]},
    ]
}
# application key value
headers = {
    'Content-Type':'application/json',
    'X-Naver-Client-id':CLIENT_ID,
    'X-Naver-Client-Secret':CLIENT_SECRET,
}

# ์„œ๋ฒ„์— ๋ฐ์ดํ„ฐ ์š”์ฒญ : `json_dumps()` : ํ•œ๊ธ€ ์ธ์ฝ”๋”ฉ ์šฉ๋„
response = requests.post(url, json.dumps(params), headers=headers)    # <Response [200]>

# ๋ฐ›์€ ๋ฐ์ดํ„ฐ ํŒŒ์‹ฑ(๋ฐ์ดํ„ฐ ํ˜•ํƒœ ๋ณ€๊ฒฝ)
data = response.json()["results"]

# DataFrame ๋ณ€ํ™˜
dfs = []
for data in datas:
    df = pd.DataFrame(data["data"])
    df["title"] = data["title"] # ํŠธ์œ„ํ„ฐ, ํŽ˜์ด์Šค๋ถ, ์ธ์Šคํƒ€๊ทธ๋žจ
    dfs.append(df)

# Processing
df = pd.DataFrame({
    'date': [period['period'] for period in data[0]['data']],
    'twitter': [ratio['ratio'] for ratio in data[0]['data']],
    'instagram': [ratio['ratio'] for ratio in data[1]['data']],
    'facebook': [ratio['ratio'] for ratio in data[2]['data']],
})

# show chart
df.plot(figsize=(20,5))

image

request example:daum exchange
import pandas as pd
import requests

url = "https://finance.daum.net/api/exchanges/summaries"
headers = {
    "user-agent": "user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
    "referer": "https://finance.daum.net/exchanges",
}
response = requests.get(url, headers=headers)

datas = response.json()["data"]
df = pd.DataFrame(datas)
columns = ["date", "currencyCode", "currencyName", "country", "name", "basePrice"]
df[columns].head()

image

request example:zigbang ์œ„๋„, ๊ฒฝ๋„
import requests
import pandas as pd
import geohash2

addr = "๋ง์›๋™"
url = f"https://apis.zigbang.com/v2/search?leaseYn=N&q={addr}&serviceType=์›๋ฃธ"
response = requests.get(url)
data = response.json()["items"][0]
lat, lng = data["lat"], data["lng"]

# geohash๋กœ ์œ„๋„, ๊ฒฝ๋„ ์•Œ์•„๋‚ด๊ธฐ, !pip install geohash2
geohash = geohash2.encode(lat, lng, precision=5)

# geohash๋กœ ๋งค๋ฌผ ์•„์ด๋”” ์•Œ์•„๋‚ด๊ธฐ
url = f"https://apis.zigbang.com/v2/items?deposit_gteq=0&domain=zigbang\
&geohash={geohash}&needHasNoFiltered=true&rent_gteq=0&sales_type_in=์ „์„ธ|์›”์„ธ\
&service_type_eq=์›๋ฃธ"
response = requests.get(url)
datas = response.json()["items"]
# len(datas), datas[0]
ids = [data["item_id"] for data in datas]

# ๋งค๋ฌผ์•„์ด๋””๋กœ ๋งค๋ฌผ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ
url = "https://apis.zigbang.com/v2/items/list"
params = {
    "domain": "zigbang",
    "withCoalition": "true",
    "item_ids": ids
}
response = requests.post(url, params)

# parsing
datas = response.json()["items"]
df = pd.DataFrame(datas)

columns = ["item_id", "sales_type", "deposit", "rent", "size_m2", "floor", "building_floor",
           "address1", "manage_cost"]
filtered_column_df = df[columns]

# ์ฃผ์†Œ์— ๋ง์›๋™์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋งŒ ํ•„ํ„ฐ๋ง
result_df = filtered_column_df[filtered_column_df["address1"].str.contains("๋ง์›๋™")]
result_df = result_df.reset_index(drop=True)
result_df.tail(2)

image

์›๋‹ฌ๋Ÿฌ ํ™˜์œจ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
def usa_data(pagesize, page, code):
    url = f'https://api.stock.naver.com/marketindex/exchange/{code}/prices?page={page}&pageSize={pagesize}'
    response = requests.get(url)
    data = response.json()
    return pd.DataFrame(data)

# 60ํŽ˜์ด์ง€๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ
usd = usa_data(60, 1, 'FX_EURKRW')

# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
df = kospi.copy()   # ์™œ ์นดํ”ผ๋ฅผ ์ผ๋Š”์ง€๋Š” ๋งˆ์ง€๋ง‰์— ์„ค๋ช…์„ ํ•˜๋„๋ก ํ•จ
df['kosdaq'] = kosdaq['closePrice']
df['usd'] = usd['closePrice']
df = df.rename(columns={'closePrice':'kospi'})

# ์ปฌ๋Ÿผ์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€๊ฒฝ : str > float
# df[column].apply() : ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ์ˆ˜์— ๋Œ€์ž…ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ
df['kospi'] = df['kospi'].apply(lambda data:float(data.replace(',', '')))
df['kosdaq'] = df['kosdaq'].apply(lambda data:float(data.replace(',', '')))
df['usd'] = df['usd'].apply(lambda data:float(data.replace(',', '')))

df[['kospi', 'kosdaq', 'usd']].corr()
# kospi - kosdaq : 0.984 : 1๊ณผ ๊ฐ€๊นŒ์šฐ๋ฉด ๊ฐ•ํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„
# kospi - usd : -0.878 : -1๊ณผ ๊ฐ€๊นŒ์šฐ๋ฉด ๊ฐ•ํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„ 

image


CSS Selector

BeautifulSoup

css-selector์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š”๊ฒƒ์ด BeautifulSoup

# ์—˜๋ฆฌ๋จผํŠธ ํ•˜๋‚˜ ์„ ํƒ : select_one()
# tag์ด๋ฆ„ : span
# id : #kt
# class : .kt
# attr : [value='kt']

# ์—ฌ๋Ÿฌ๊ฐœ์˜ ์—˜๋ฆฌ๋จผํŠธ ์„ ํƒ : select()
# .not(.kt1)
# :nth-child(n)
# .wrap > p   : ํ•œ ๋‹จ๊ณ„ ํ•˜์œ„ ์—˜๋ฆฌ๋จผํŠธ์—์„œ ์„ ํƒ
# .wrap p   : ๋ชจ๋“  ํ•˜์œ„ ์—˜๋ฆฌ๋จผํŠธ์—์„œ ์„ ํƒ
# .kt1, .kt2   : ์„ ํƒํ•˜๋Š” ๋ชจ๋“  ์—˜๋ฆฌ๋จผํŠธ ์„ ํƒ

n๋ฒˆ์งธ element์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•
: .py:nth-child(2)
    โ€ป์ฃผ์˜์‚ฌํ•ญ, .py element์ค‘์— 2๋ฒˆ์งธ๊ฐ€ ์•„๋‹Œ 
    2๋ฒˆ์งธ ์—˜๋ฆฌ๋จผํŠธ์ค‘์— ํด๋ž˜์Šค๊ฐ€ .py์ธ ์—˜๋ฆฌ๋จผํŠธ๋ฅผ ์„ ํƒํ•˜๋Š”๊ฒƒ

๊ณ„์ธต์  element์„ ํƒ(3๊ฐ€์ง€ ๋ฐฉ๋ฒ•)
    1. ๋ชจ๋“  ํ•˜์œ„ ์—˜๋ฆฌ๋จผํŠธ ์„ ํƒ, ( ) ๊ทธ๋ƒฅ ๊ณต๋ฐฑ ์‚ฌ์šฉ
        (.wrap p)	# .wrap์—์„œ p์—˜๋ฆฌ๋จผํŠธ
    2. ํ•œ๋‹จ๊ณ„ ํ•˜์œ„ ์—˜๋ฆฌ๋จผํŠธ (.wrap > p)
    3. ์—ฌ๋Ÿฌ๊ฐœ ์„ ํƒ (.no1, .no2)	# ๋‘๊ฐœ์˜ ํด๋ž˜์Šค๋ฅผ ๋ชจ๋‘ ์„ ํƒ

selector example:collect naver relation keyword
import pandas as pd
import requests
from bs4 import BeautifulSoup

# URL
keyword = 'kt'
url = f'https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query={keyword}'

# ์ •์ ํŽ˜์ด์ง€๋Š” 'JSON'์ด ์•„๋‹Œ 'HTML'์ด ์˜ค๊ธฐ์— ๋ณ€ํ™˜์ด ํ•„์š”
response = requests.get(url)
dom = BeautifulSoup(response.text, 'html.parser')

# select() : ์—˜๋ฆฌ๋จผํŠธ ์—ฌ๋Ÿฌ๊ฐœ ์„ ํƒ
# select_one() : ์—˜๋ฆฌ๋จผํŠธ ํ•œ๊ฐœ ์„ ํƒ
elements = dom.select('.lst_related_srch > .item')   # BeautifulSoap ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์ด์œ  CSS_Selector๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ ์œ„ํ•ด

# ์ด 10๊ฐœ์˜ ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ ธ์™€๋ณด์ž
# keywords = [element.select_one('.tit').text for element in elements]
keywords = []
for element in elements:
    keyword = element.select_one('.tit').text
    keywords.append(keyword)
['์‚ผ์„ฑ์ „์ž', 'kt ๊ณ ๊ฐ์„ผํ„ฐ', 'kt ์ธํ„ฐ๋„ท', 'ky', 'ํ™˜์œจ', 'kr', '๋‚ ์”จ', 'kt ๋Œ€๋ฆฌ์ ', 'SKT', 'kt ๊ณ ๊ฐ์„ผํ„ฐ ์ „ํ™”๋ฒˆํ˜ธ']

selector example:gmarket best 200
import pandas as pd
import requests
from bs4 import BeautifulSoup

# ๊ฐœ๋ฐœ์ž๋„๊ตฌ > Networkํƒญ๋ˆ„๋ฅด๊ณ  > ์ƒˆ๋กœ๊ณ ์นจ > BestSellersํด๋ฆญ > 
url = 'http://corners.gmarket.co.kr/Bestsellers'

response = requests.get(url)

# HTML(str) > BeautifulSoup Object
dom = BeautifulSoup(response.text, 'html.parser')
type(dom)     # bs4.BeautifulSoup

# 200๊ฐœ์˜ ์ƒํ’ˆ ์—˜๋ฆฌ๋จผํŠธ ์„ ํƒ
# Elements > ํด๋ฆญ > 
elements = dom.select('#gBestWrap > div > div:nth-child(5) > div > ul > li')    # ๋ชจ๋“  li๋ฅผ ์„ ํƒํ•˜๋ ค๋ฉด nth์‚ญ์ œ
len(elements)    # 200

# element์˜ ๋‚ด์šฉ์„ DataFrame์œผ๋กœ ๋งŒ๋“ฆ
element = elements[0]

# dictionary format์œผ๋กœ ๋งŒ๋“ค์–ด์คŒ
# ์ด๋ฏธ์ง€๋Š” img ํƒœ๊ทธ ์•„๋ž˜์˜ data_origin
items = []
for element in elements:
    data = {
    'title':element.select_one('.itemname').text,
    'o_price':element.select_one('.o-price').text,
    's_price':element.select_one('.s-price > strong').text,
    'img':'http:' + element.select_one('img').get('data-original'),
    }
    items.append(data)

df = pd.DataFrame(items)
df.tail(2)

image


์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ ์ฝ”๋“œ
import pandas as pd
import requests, os
# os๋Š” filesystem์„ ๋‹ค๋ฃจ๋Š” ์‹œ์Šคํ…œ, ํŒŒ์ผ ์‚ญ์ œ, ๋ณต์‚ฌ, ์ด๋™ํ•˜๋Š” ํŒจํ‚ค์ง€ 

# ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ
path = 'data'
if not os.path.exists(path):
    os.makedirs(path)    # data์ด๋ฆ„์˜ folder์ƒ์„ฑ

# csvํŒŒ์ผ์„ ๋กœ๋“œ(csvํŒŒ์ผ ๋‚ด์— URL๊ฒฝ๋กœ๊ฐ€ ์žˆ์—ˆ์Œ)
df = pd.read_csv('gmarket.csv')

# ๊ฐ€์žฅ ์œ„์— ์žˆ๋Š” ์ด๋ฏธ์ง€URL ๋ถˆ๋Ÿฌ์˜ค๊ธฐ http://gdimg.gmarket.co.kr/2266434001/still/30...
img_link = df.loc[0, 'img']     # loc[ํ–‰์กฐ๊ฑด, ์—ด์ด๋ฆ„]

# image download : requests๋ฅผ ์ด์šฉํ•ด์„œ ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ๋ฅผ ๋ฐ›์„ ๊ฒƒ
response = requests.get(img_link)

# ์ด๋ฏธ์ง€ ์ €์žฅ
with open(f'{path}/test.png', 'wb') as file:   # ์ €์žฅ์‹œ:wb(write binary), ์ฝ์„์‹œ:rb
    file.write(response.content)    # ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹ˆ๊ธฐ๋•Œ๋ฌธ์— 'content' ์‚ฌ์šฉ

# img ์ถœ๋ ฅ
from PIL import Image as pil
pil.open(f'{path}/test.png')

# ํŒŒ์ผ์ด๋ฆ„ ์ง€์ • ์—ฌ๋Ÿฌ๊ฐœ
for idx, data in df[:3].iterrows():
    # filename = '0' * (3 - len(str(idx))) + str(idx) + '.png'
    filename = '%03d.png' % (idx)
    print(idx,filename, data['img'])
    response = requests.get(data['img'])
    with open(f'{path}/{filename}', 'wb') as file:
        file.write(response.content)

pil.open(f'{path}/001.png')


selenium

๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
QAํŒ€, ๊ฒ€์ˆ˜ํ•˜๋Š” ๊ณผ์ •์„ ์ž๋™ํ™”ํ•˜๋Š” ํŒจํ‚ค์ง€
๋ธŒ๋ผ์šฐ์ €๋ฅผ python์–ธ์–ด๋กœ ์ปจํŠธ๋กคํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š”๊ฒƒ


ํ™˜๊ฒฝ์„ค์ •

๊ฒ€์ƒ‰ > chrome driver
chrome browser ๋ฒ„์ „ ํ™•์ธ
์˜ค๋ฅธ์ชฝ ์ƒ๋‹จ ์ฉœ์ฉœ์ฉœ ํด๋ฆญ
๋„์›€๋ง > ํฌ๋กฌ์ •๋ณด
104.0.5112.79(๋งจ ์•ž์˜ ์„ธ๊ธ€์ž๋งŒ ํ™•์ธ)
chromedriver.win32.zip ๋‹ค์šด๋กœ๋“œ

chromedriver๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” 
์•„๋ž˜์˜ selenium์˜ `webdriver`๋กœ `Chrome()`ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ์„์‹œ
`Chrome`๋ธŒ๋ผ์šฐ์ € ์ฐฝ์„ ๋„์šฐ๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค

# find_element() == BeautifulSoup.select_one() , find_elements() = BeautifulSoupselect()
# pandas, numpy๋Š” python๋งŒ ์ง€์›ํ•˜๋Š”๋ฐ selenium์€ ๋‹ค์–‘ํ•œ ์–ธ์–ด, ๋ธŒ๋ผ์šฐ์ €์— ์ง€์›์ด ๊ฐ€๋Šฅ
# !pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
# ๋ธŒ๋ผ์šฐ์ € ๋„์šฐ๊ธฐ
driver = webdriver.Chrome()

# ํŽ˜์ด์ง€์ด๋™
driver.get('url')

# ๋ธŒ๋ผ์šฐ์ € ์‚ฌ์ด์ฆˆ ์กฐ์ ˆ
driver.set_window_size(200, 600)

# ๋ธŒ๋ผ์šฐ์ € ์Šคํฌ๋กค ์กฐ์ ˆ(html css javascript) > javascript code ์‹คํ–‰
driver.execute_script('window.scrollTo(200, 300)') 

# ๋ธŒ๋ผ์šฐ์ € ์Šคํฌ๋กค ์›์ƒ ๋ณต๊ท€
driver.execute_script('window.scrollTo(0, 0)')

# alert ๋‹ค๋ฃจ๊ธฐ, ์›นํ”ผ์ด์ง€ ์“ธ๋•Œ ์กฐ๊ทธ๋งŒ ํŽ˜์ด์ง€๊ฐ€ ๋œจ๋Š”๋ฐ ํ•ด๋‹น ์ฐฝ์ด ์•ˆ๋– ์•ผ ํฌ๋กค๋ง๊ฐ€๋Šฅ
#alert์ฐฝ ๋„์šฐ๊ธฐ
driver.execute_script("alert('hello selenium!');")

# alert์ฐฝ ๋‹ซ๊ธฐ
alert = driver.switch_to.alert
alert.accept()

# inpuut ์ฐฝ์— ๋ฌธ์ž์—ด ์ž…๋ ฅ, ๋”ฐ๋กœ ํ•ด๋‹น ์‚ฌ์ดํŠธ๋ฅผ ์—ด์–ด๋†“๊ณ 
# ๊ฐœ๋ฐœ์ž๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ selector๋ฅผ ์‚ฌ์šฉํ•ด htmlํ™•์ธ
driver.find_element(By.CSS_SELECTOR, '#q').send_keys('ํŒŒ์ด์ฌ')    # q์ด์œ ๋Š” element๋กœ css selector๋กœ ํ™•์ธํ–ˆ์„๋•Œ id='q'์˜€๋‹ค

# ๊ฒ€์ƒ‰ ๋ฒ„ํŠผ ํด๋ฆญ,  `.`๋ถ™์—ฌ์ฃผ๋ฉด ๋‘˜๋‹ค์„ ํƒ
driver.find_element(By.CSS_SELECTOR, '.inner_search > .ico_pctop.btn_search').click()

# ๋ธŒ๋ผ์šฐ์ € ์ข…๋ฃŒ
driver.quit()

ํ…์Šคํ„ฐ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘:TED
from selenium import webdriver
from selenium.webdriver.common.by import By

# ๋ธŒ๋ผ์šฐ์ € ์—ด๊ธฐ
driver = webdriver.Chrome()

# ํŽ˜์ด์ง€ ์ด๋™
driver.get('https://ted.com/talks')

# ์ œ๋ชฉ ๋ฐ์ดํ„ฐ ํ…์ŠคํŠธ ์ˆ˜์ง‘
# css-selector : #bannersecondary
sub_title = driver.find_element(By.CSS_SELECTOR, '#banner-secondary').text   #.text๋ฅผ ํ•ด์ฃผ์–ด์•ผ ์ œ๋ชฉ๊ฐ€์ ธ์˜ด

# select box์—์„œ ํ•œ๊ตญ์–ด ํด๋ฆญ
# languages > optgroup > [lang='ko'] ๋˜๋Š” languages [lang='ko']   `ํ•œ ์นธ ๋„์–ด์คŒ์œผ๋กœ์จ ๋ชจ๋‘ ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•
driver.find_element(By.CSS_SELECTOR, '#languages [lang-'ko']').click()

# ํ•œ๊ตญ์–ด ๋ฉ”๋‰ด๊ฐ€ ๋‚˜์˜ค๋ฉด, ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์ œ๋ชฉ๊ณผ ๋งํฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘
elements = driver.find_elements(By.CSS_SELECTOR, '#browse-results > .row > div')
len(elements)    # 36

# ga-link ์ฒซ๋ฒˆ์งธ ๋ชฉ๋ก๋ฐ์ดํ„ฐ
element = elements[0]

# ga-link๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ ๊ฐ’์ด ๋‚˜์˜ค์ง€ ์•Š์Œ
# h4๋Š” class ์•ž์˜ attr
title = element.find_element(By.CSS_SELECTOR, 'h4 > .ga-link').text     
link = element.find_element(By.CSS_SELECTOR, 'h4 > .ga-link').get_attribute('href')   # element์˜ ์†์„ฑ๊ฐ’ ๊ฐ€์ ธ์˜ค๊ธฐ

# ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ค๊ธฐ
data = []
for element in elements:
    data.append({
        'title' : element.find_element(By.CSS_SELECTOR, 'h4 > .ga-link').text,
        'link' : element.find_element(By.CSS_SELECTOR, 'h4 > .ga-link').get_attribute('href'),
    })

df = pd.DataFrame(data)

image

HeadLess

๋ธŒ๋ผ์šฐ์ ธ๋ฅผ ํ™”๋ฉด์— ๋„์šฐ์ง€ ์•Š๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„œ๋งŒ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํฌ๋กค๋ง ํ•˜๋Š” ๋ฐฉ๋ฒ•
window๊ฐ€ ์ง€์›๋˜์ง€ ์•Š๋Š” ํ™˜๊ฒฝ์—์„œ selenium์‚ฌ์šฉ ๊ฐ€๋Šฅ
ํฌ๋กฌ ๋ฒ„์ „ : 60 ์ด์ƒ๋ถ€ํ„ฐ ์ง€์› ๊ฐ€๋Šฅ
์…€๋ ˆ๋‹ˆ์›€ headless ๋ชจ๋“œ๋Š” ํ™”๋ฉด์— ํ‘œ์‹œ๋˜๋Š” ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ์—†๋Š” ๋ชจ๋“œ์ง€๋งŒ ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ํ™”๋ฉด์„ ๊ฐ€์ง€๊ณ ,
๋ Œ๋”๋ง๋˜๋Š” ํŽ˜์ด์ง€๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ , headless ๋ชจ๋“œ๋Š” ํ™”๋ฉด์‚ฌ์ด์ฆˆ๊ฐ€ 800x600

# ํฌ๋กฌ ๋ฒ„์ „ ํ™•์ธ
driver = webdriver.Chrome()
version = driver.capabilities['browserVersion']
driver.quit()
version
๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋„์šฐ์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
# ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋„์šฐ๋‚˜ ์•ˆ๋„์šฐ๋‚˜ ์†๋„๋Š” ํฌ๊ฒŒ ์ฐจ์ด ์—†์Œ
options = webdriver.ChromeOptions()    # tag less?
options.add_argument('headless')    # argument๋Š” ์œˆ๋„์šฐ์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ ˆ?

# options๋ฅผ ์„ค์ •ํ•ด์ฃผ๋ฉด browser ํ™”๋ฉด์ด ๋œจ์ง€ ์•Š์€์ฑ„ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ์Œ
driver = webdriver.Chrome(options=options)    
driver.get('https://ted.com/talks')
sub_title = driver.find_element(By.CSS_SELECTOR, '#banner-secondary').text
driver.quit()
sub_title

iframe ๋„ค์ด๋ฒ„ ์ค‘๊ณ ๋‚˜๋ผ ๊ฒŒ์‹œ๊ธ€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
# iframe์€ ์›นํŽ˜์ด์ง€๋ฅผ ์ƒˆ๋กœ์šด ํŽ˜์ด์ง€ ์–บ๋Š” ๋ฐฉ์‹
# ํฌ๋กค๋ง์œผ๋กœ ์œ ๋ช…ํ•œ๊ฒƒ์ด ๋ฑ…ํฌ์ƒ๋Ÿฌ๋“œ ์•ฑ์ด ์žˆ๋‹ค

from selenium import webdriver
from selenium.webdriver.common.by import By

# ์›น๋ธŒ๋ผ์šฐ์ € ์—ด๊ธฐ ๋ฐ ํŽ˜์ด์ง€ ์ด๋™
driver = webdriver.Chrome()
driver.get(url)

# rendering ๋˜๋Š” ๋™์•ˆ ๋ชป์ฐพ์„์ˆ˜๋„ ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ์‹œ๊ฐ„์„ ์ฃผ๋ฉด ์ข‹๋‹ค
import time
time.sleep(3)   # 3์ดˆ

# 2. ๊ฒ€์ƒ‰์ฐฝ์— '๋งฅ๋ถ'์ž…๋ ฅํ•˜๊ณ  ๊ฒ€์ƒ‰ ๋ฒ„ํŠผ ํด๋ฆญ
# + selenium์€ ์›นํŽ˜์ด์ง€๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ ๊ฒƒ์ธ๋ฐ ์›นํŽ˜์ด์ง€ ๋‚ด์— ๋ณด์ด์ง€ ์•Š๋Š” ์˜์—ญ์„ ํด๋ฆญํ•˜๊ฒŒ ๋˜๋ฉด
# ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค ๊ทธ๋ž˜์„œ ์Šคํฌ๋กค์„ ํ•˜์—ฌ ์›ํ•˜๋Š” ๋ ˆ์ด์•„์›ƒ์ด ๋ณด์ผ ์ˆ˜ ์žˆ๋„๋ก scrollํ•ด์ฃผ์–ด์•ผํ•จ 
# send_keys๋กœ ๊ฒ€์ƒ‰์–ด์— ํ…์ŠคํŠธ ์ž…๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•
# ๊ฒ€์ƒ‰์ฐฝ ID : topLayerQueryInput
keyword = '๋งฅ๋ถ'
driver.find_element(By.CSS_SELECTOR, '#topLayerQueryInput').send_keys(keyword)

# ๊ฒ€์ƒ‰๋ฒ„ํŠผ ํด๋ฆญํ•˜๋Š” ์ฝ”๋“œ
# searchBoard();return false; javascript onClick ๋ฒ„ํŠผ ์ž…๋ ฅ : ์†๋„๋„ ํ›จ์”ฌ ๋น ๋ฆ„
driver.execute_script('searchBoard();')

# ๊ฒŒ์‹œ๊ธ€ ๋ชฉ๋ก ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (๋งค์šฐ์ค‘์š”ํ•œ ๋ถ€๋ถ„โ˜…) : iframe
selector = '.article-board'    # table tbody tr
elements = driver.find_elements(By.CSS_SELECTOR, '.article-board > table > tbody > tr')
len(elements)    # 0 ์ด ๋‚˜์™”๋‹ค๋Š”๊ฒƒ์€ ์ ‘๊ทผํ•  ์ˆ˜ ์—†๋‹ค๋Š”๊ฒƒ

# ์™œ ์„ ํƒ์„ ๋ชปํ•˜๋ƒ๋ฉด ์ „์ฒด ๊ฒŒ์‹œ๊ธ€๋ถ€๋ถ„์—์„œ ์ญ‰ ์˜ฌ๋ผ๊ฐ€๋ณด๋ฉด iframe์œผ๋กœ ๋จ, ํŽ˜์ด์ง€์•ˆ์— ํŽ˜์ด์ง€๊ฐ€ ์žˆ๋Š”๊ฒƒ
# iframe์•ˆ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†๋‹ค.
# ์•ˆ๋“œ๋กœ์ด๋“œ Acitvity์™€ windowpopup์˜ ํฌ์ปค์‹ฑ ๋ฌธ์ œ์™€ ๋น„์Šทํ•œ ๋ฌธ์ œ๋กœ ๋ณด์—ฌ์ง

# iframe์œผ๋กœ driver ์ด๋™, id = 'cafe_main'
iframe = driver.find_element(By.CSS_SELECTOR, "#cafe_main")    # iframe ๊ฐ์ฒด๋ฅผ ์„ ํƒํ•ด์ค„๊ฒƒ
iframe

# ์•ˆ์— ์žˆ๋Š” frame ์„ ํƒ
driver.switch_to.frame(iframe)  

### ์›๋ž˜ default frame์œผ๋กœ ์ด๋™
driver.switch_to.default_content()

# iframe์œผ๋กœ driver์„ ์ด๋™ํ•œ ๋’ค์— ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํƒํ•˜๋‹ˆ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ ํƒ์ด ๋˜์—ˆ๋‹ค
selector = '.article-board'
elements = driver.find_elements(By.CSS_SELECTOR, '.article-board > table > tbody > tr')
len(elements)   # 15

# ์ œ๋ชฉ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ 
data = []
for element in elements:
    data.append({
        'title':element.find_element(By.CSS_SELECTOR, ".article").text,
        'writer':element.find_element(By.CSS_SELECTOR, '.p-nick').text,
    })
    
df = pd.DataFrame(data)
df.tail(2)

# ๋ธŒ๋ผ์šฐ์ € ์ข…๋ฃŒ
driver.quit()    # ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ฐ˜๋‚ฉ๋˜์–ด ์ปดํ“จํ„ฐ๊ฐ€ ๋‹ค๋ฅธ ์ผ์„ ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก...

image


scrapy


sklearn

premierleague.csv ๋ฐ•๋‘์ง„ ๊ฐ•์‚ฌ๋‹˜ ์ฐธ์กฐ ํŒŒ์ผ ์‚ฌ์šฉ

์Šน์ ์˜ˆ์ธก ์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ

df = pd.read_csv('premierleague.csv')

feature = df[['gf', 'ga']]    # ๋“์ , ์‹ค์ 
target = df[['points]]

# !pip install sklearn

from sklearn.linear_model import LinearRegression
import numpy as np
# ์ธ๊ณต์ง€๋Šฅ ๊ฐ์ฒด
model = LinearRegression().fit(feature, target)
# RAM์— ์žฅ์ฐฉ๋จ, ์‹ค์ œ๋กœ๋Š” ํ•˜๋“œ๋””์Šคํฌ์— ์ €์žฅํ•ด์„œ ์‚ฌ์šฉํ•ด์•ผํ•จ

# ๋ฐ์ดํ„ฐ ์˜ˆ์ธก, ๋“์ 80 ์‹ค์ 36์ด๋ฉด ์Šน์ ์€?
model.predict([[80, 36]])    # array([[78.90247916]])

np.round(model.predict([80, 36]]))   # array([[79.]])
# ์ปดํ“จํ„ฐ๋ฅผ ๋„๊ฒŒ๋˜๋ฉด RAN์— ์žˆ๋˜ MODEL์ด ์ฃฝ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์„ ํ•˜๋“œ๋””์Šคํฌ์— ์ €์žฅํ•ด์คŒ
# model > SSD or HDD ์ €์žฅ 
# ๋ชจ๋ธ ๊ฐ์ฒด ์ €์žฅํ•˜๋Š” ๋ฐฉ๋ฒ• (๋ฉ”๋ชจ๋ฆฌ > ์ €์žฅ์žฅ์น˜)
import pickle

# ram > ssd
with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)    # model์€ ์œ„์— ์Šน์ ์˜ˆ์ธกํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ 

# ssd > ram
with open('model.pkl', 'rb') as file:
    load_model = pickle.load(file)

np.round(load_model.predict([[80, 36]]))    # array([[79.]])

โš ๏ธ **GitHub.com Fallback** โš ๏ธ