Datalake : s3, parquet - helloMinji/chatbot_spotify GitHub Wiki

  • ํ•„์š”ํ•œ ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import boto3        # aws sdk
import pandas as pd # ํŠน์ • jsonpath๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด
import jsonpath     # pip3 install jsonpath --user

image

1. top-track API ํ™œ์šฉ

1. ํ•„์š”ํ•œ key ๋ฏธ๋ฆฌ ์„ค์ •

  • jsonpath ํŒจํ‚ค์ง€๋ฅผ ํ†ตํ•ด ํ•ด๋‹น path ์•ˆ์—์„œ key์— ๋งž๋Š” value๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
  • parquet ์—๋Ÿฌ๋ฅผ ํ”ผํ•˜๊ณ , ์›ํ•˜๋Š” ๊ฐ’๋งŒ ๊ฐ€์ ธ์˜จ๋‹ค.
    • key: ๊ฐ–๊ณ ์˜ค๊ณ  ์‹ถ์€ ๊ฐ’
    • value: api์—์„œ์˜ key, ์ฆ‰ path (๋นจ๊ฐ„์ƒ‰ ํ‘œ์‹œ๋œ ๊ฒƒ๋“ค, ์ด๋ฏธ์ง€ ์ฐธ๊ณ )

external_urls์˜ value๊ฐ€ ๋”•์…”๋„ˆ๋ฆฌ, ๊ทธ ๋”•์…”๋„ˆ๋ฆฌ์˜ key๊ฐ€ spotify์—ฌ์„œ ์ด๋Ÿฐ ํ˜•ํƒœ๋กœ ๊ฐ€์ ธ์˜จ๋‹ค.

	top_track_keys = {
		"id": "id",
		"name": "name",
		"popularity": "popularity",
		"external_url": "external_urls.spotify"
	}

black2

2. api๋กœ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    top_tracks = []

for๋ฌธ์„ ๋Œ๋ฉด์„œ cursor์—์„œ ๊ฐ€์ ธ์˜จ(์ฝ”๋“œ์˜ 1. ๋ถ€๋ถ„) id๊ฐ€ ๋ถ™์€ url์ด ์ถ”๊ฐ€๋œ๋‹ค.

    for (id, ) in cursor.fetchall():
        URL = "https://api.spotify.com/v1/artists/{}/top-tracks".format(id)
        params = {
            'country': 'US'
        }
        r = requests.get(URL, params=params, headers=headers)
        raw = json.loads(r.text) # api๋กœ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ ์ €์žฅ

black2

3. ๋ฐ์ดํ„ฐ ์—…๋ฐ์ดํŠธ

jsonpath๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด์„œ ์ƒ๊ธด for loop

    for i in raw['tracks']:
  • raw๋ผ๋Š” dictionary์˜ tracks๋ผ๋Š” key์— ํ•ด๋‹นํ•˜๋Š” value๊ฐ€ i๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค.
    (ํ•‘ํฌ์ƒ‰ v ํ‘œ์‹œ๋œ ๋ถ€๋ถ„ ์‹œ์ž‘๋ถ€ํ„ฐ ๋๊นŒ์ง€๊ฐ€ value, ์ฆ‰ i!)
  • i์˜ ํƒ€์ž…: dictionary

id, name, popularity, external_url ๋ฐ์ดํ„ฐ ์—…๋ฐ์ดํŠธ

    top_track = {}
    
    for k, v in top_track_keys.items():
        top_track.update({k: jsonpath.jsonpath(i, v)}
  • i๋ผ๋Š” dictionary์—์„œ v๊ฐ€ key์ธ value๋ฅผ ๊ฐ€์ ธ์™€๋ผ
  • jsonpath์˜ ๊ฒฐ๊ณผ! ๊ทธ๋ž˜์„œ k-jsonpath์˜ ๊ฒฐ๊ณผ๊ฐ€ key-value๋กœ top_track์— ์—…๋ฐ์ดํŠธ๋œ๋‹ค!

artist_id ๋Š” ์ด๋ฏธ ๊ฐ–๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— jsonpath ์‚ฌ์šฉ ์•ˆ ํ•จ

        top_track.update({'artist_id': id})  
        top_tracks.append(top_track)
        
    track_ids = [i['id'][0] for i in top_tracks]

black2

4. DataFrame -> Parquet โญ

  • s3์—๋Š” json๋ณด๋‹ค parquet์œผ๋กœ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด ๋” ์ข‹๊ธฐ ๋•Œ๋ฌธ
    top_tracks = pd.DataFrame(top_tracks)
    top_tracks.to_parquet('top-tracks.parquet', engine='pyarrow', compression="snappy")
  • ์œ„์—์„œ ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์— parquetํ™” ์—๋Ÿฌ ์—†์Œ
  • ๋ณดํ†ต์€ json์œผ๋กœ ๋จผ์ € ์ €์žฅํ•˜๊ณ , ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋Š” ๊ฒฝ์šฐ ์ด๋ฏธ ์ง€์ •๋œ ํ˜•ํƒœ๋กœ parquetํ™”๋ฅผ ํ•˜๊ณ , ๋‹ค๋ฅธ ๋ฒ„ํ‚ท์— ๋„ฃ๋Š”๋‹ค. ์™„์ „ ์—๋Ÿฌ ์—†๊ธฐ ์œ„ํ•ด!
    ๐Ÿ‘‰ ์ด ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ๐Ÿ‘ˆ

black2

5. s3์— ์ €์žฅ โญ

    dt = datatime.utcnow().strftime("%Y-%m-%d")
  • utcnow: unix time

๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ๋กœ ํŒŒํ‹ฐ์…˜ ์ƒ์„ฑ

    s3 = boto3.resource('s3')
    object = s3.Object('spotify-artists', 'top-tracks/dt={}/top-tracks.parquet'.format(dt))
  • argument1: ๋ฒ„ํ‚ท ์ด๋ฆ„
  • argument2: ํŒŒํ‹ฐ์…˜ ์ƒ์„ฑ.
    ๋‚ ์งœ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํŒŒํ‹ฐ์…˜์„ ๋งŒ๋“ค์–ด์„œ, ๊ฐ€์žฅ ์ตœ๊ทผ ๊ฒƒ์„ ๊ฐ€์ ธ์˜ค๊ฑฐ๋‚˜ ์›ํ•˜๋Š” ์‹œ๊ธฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
    data = open('top-tracks.parquet','rb')
    object.put(Body=data)

image

2. Audio features API ํ™œ์šฉ (batch)

1. ๋ฐฐ์น˜ ๋งŒ๋“ค๊ธฐ

        tracks_batch = [track_ids[i: i+100] for i in range(0, len(track_ids), 100)] # audio features๋Š” 100๊ฐœ๊นŒ์ง€๋งŒ ๊ฐ€๋Šฅ

        audio_features = []

2. batch ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

        for i in tracks_batch:
            ids = ','.join(i)
            URL = "https://api.spotify.com/v1/audio-features/?ids={}".format(ids)
            r = requests.get(URL, headers=headers)
            raw = json.loads(r.text)

3. ๋ฐ์ดํ„ฐ ์—…๋ฐ์ดํŠธ

            audio_features.extend(raw['audio_features'])

4. DataFrame -> Parquet โญ

        audio_features = pd.DataFrame(audio_features)
        audio_features.to_parquet('audio-features.parquet', engine='pyarrow', compression='snappy')

5. s3์— ์ €์žฅ โญ

        s3 = boto3.resource('s3')
        object = s3.Object('spotify-artists', 'audio-features/dt={}/audio-features.parquet'.format(dt))
        data = open('audio-features.parquet','rb')
        object.put(Body=data)