Deep Reinforcement Learning for Sponsored Searched Real-time Bidding

논문: https://arxiv.org/abs/1803.00259
Paper Collection of Real-Time Bidding

Overview

Dynamic environment 문제를 해결하는 reinforcement learning solution

How to handle environment changing problem?
- 1시간 단위의 auction 데이터에 대한 robust MDP model를 고안
- Impression 데이터를 적절히 aggregation (e.g. by hour)하며, 주기적인 (state transition에 대한) 패턴을 관찰
SS-RTB를 위한 control-by-model framework를 고안
- Bid price를 직접 예측하기보다는, 해당 hour의 impression을 위한 bidding model을 만들고 real-time bidding을 한다.
- 즉, real-time bidding model을 control하기 위한 optimal parameter policy를 학습
<-> control-by-action
Massive agent learning algorithm
- Reward: competitive reward + cooperative reward

Problem Definition

Goal
- Day d에 수입으로 들어오는 구매량 PUR_AMT_d 를 최대화, 비용 COST_d 를 최소화
- constraint: PUR_AMT_d 는 advertiser의 기대치인 g 보다 작아서는 안 된다.
식:
- max PUR_AMT_d / COST_d s.t. PUR_AMT_d >= g
- = max PUR_AMT_d s.t. COST_d = c

Dataset

Alibaba's search auction platform의 1000개의 큰 ads
- 이틀치 (one day collection for train, the other day collection for test)
- 하루에 1억(100 m) auctions

Methodology

Robust MDP model

Figure 2: Patterns of Different Level Clicks에 따르면 hour-level curve가 두 날짜에 대하여 가장 비슷한 패턴을 가진다.
State s: <b, t, auct>
- b: the budget left for the ad
  - not the budget preset by the advertiser
  - the cost that the ad expects to expend in the left steps of auctions
- t: step number of sequence
- auct: feature vector
  - aggregated statistical features of auctions in time period t
  - click_cnt, imp_cnt, cost, ctr, cvr, ppc, etc.
Action a
- decision of generating the real-time bidding price for each auction
- 직접 bid price를 생성하기보다는,
- control-by-model: linear approximator
- Set alpha
  - opt_bidprice = alpha * PCVR
  PCVR: predictive conversion rate
  Previous studies have shown that the optimal bid price has a linear relationship with the impression-level evaluation (e.g. CTR)
Reward r(s, a)
- The income (in terms of PUR_AMT) gained according to a specific action a under state s
Episode ep
- 1 day
- 각 day마다 24개의 step, 각 episode는 같은 일반적인 experience를 가질 것이다.

Algorithm

Algorithm 1. DQN Learning 참고

Massive-agent Model

Figure 3: Massive-agent Framework

각 ad마다 자신만의 state에 대하여 optimal policy를 학습한 독립적인 agent를 deploy -> Agent끼리의 competition때문에 global performance가 감소
Core idea: private competitive objective + public cooperative objective
1. Cooperative reward: global reward of all the agents
2. Competitive reward: its own reward

Hyper-parameters settings

Hyper-parameter	Setting
Target network update period	10000 steps
Memory size	1 million
Learning rate	0.0001 with RMSProp method
Batch size	300
Network structure	4-layer DNN
DNN layer sizes	[15, 300, 200, 100]

Evaluation

비교 Models
- Baseline: Keyword-level bidding (KB)
- RL with auction-level MDP (AMDP)
- RL with robust MDP (RMDP)
- Massive-agent RL with robust MDP (M-RMDP)
Evaluation metric
- PUR_AMT / COST under the same COST constraint

Offline

(i) How does the DRL algorithm works for search auction data?
(ii) Does RMDP outperform AMDP under changing environment?
(iii) Is the multi-agent problem well handled by M-RMDP?

Single-agent analysis
Multi-agent analysis

Online

Metrics
- Key metric: PUR_AMT / COST
- +: CVR, ROI, PPC

Single-agent analysis
Multi-agent analysis
Convergence analysis
- 150 m sample batch가 진행되고 나서야 converge -> DQN algorithm은 data expensive
- Optimal policy를 찾기 위해서는 방대한 양의 데이터와 episode가 필요

Deep RL for SS RTB - lshhhhh/deep-learning-study GitHub Wiki

Deep Reinforcement Learning for Sponsored Searched Real-time Bidding

Overview

Problem Definition

Dataset

Methodology

Robust MDP model

Algorithm

Massive-agent Model

Hyper-parameters settings

Evaluation

Offline

Online

⚠️ GitHub.com Fallback ⚠️

Deep RL for SS RTB - lshhhhh/deep-learning-study GitHub Wiki

Deep Reinforcement Learning for Sponsored Searched Real-time Bidding

Overview

Problem Definition

Dataset

Methodology

Robust MDP model

Algorithm

Massive-agent Model

Hyper-parameters settings

Evaluation

Offline

Online

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️