Deep RL for SS RTB - lshhhhh/deep-learning-study GitHub Wiki
논문: https://arxiv.org/abs/1803.00259
Paper Collection of Real-Time Bidding
- Dynamic environment 문제를 해결하는 reinforcement learning solution
- How to handle environment changing problem?
- 1시간 단위의 auction 데이터에 대한 robust MDP model를 고안
- Impression 데이터를 적절히 aggregation (e.g. by hour)하며, 주기적인 (state transition에 대한) 패턴을 관찰
- SS-RTB를 위한 control-by-model framework를 고안
- Bid price를 직접 예측하기보다는, 해당 hour의 impression을 위한 bidding model을 만들고 real-time bidding을 한다.
- 즉, real-time bidding model을 control하기 위한 optimal parameter policy를 학습
<-> control-by-action
- Massive agent learning algorithm
- Reward: competitive reward + cooperative reward
- Goal
- Day d에 수입으로 들어오는 구매량 PUR_AMTd 를 최대화, 비용 COSTd 를 최소화
- constraint: PUR_AMTd 는 advertiser의 기대치인 g 보다 작아서는 안 된다.
- 식:
- max PUR_AMTd / COSTd s.t. PUR_AMTd >= g
- = max PUR_AMTd s.t. COSTd = c
- Alibaba's search auction platform의 1000개의 큰 ads
- 이틀치 (one day collection for train, the other day collection for test)
- 하루에 1억(100 m) auctions
-
Figure 2: Patterns of Different Level Clicks에 따르면 hour-level curve가 두 날짜에 대하여 가장 비슷한 패턴을 가진다. - State s: <b, t, auct>
-
b: the budget left for the ad
- not the budget preset by the advertiser
- the cost that the ad expects to expend in the left steps of auctions
- t: step number of sequence
-
auct: feature vector
- aggregated statistical features of auctions in time period t
click_cnt,imp_cnt,cost,ctr,cvr,ppc, etc.
-
b: the budget left for the ad
- Action a
- decision of generating the real-time bidding price for each auction
- 직접 bid price를 생성하기보다는,
- control-by-model: linear approximator
- Set alpha
- opt_bidprice = alpha * PCVR
PCVR: predictive conversion rate
Previous studies have shown that the optimal bid price has a linear relationship with the impression-level evaluation (e.g. CTR)
- Reward r(s, a)
- The income (in terms of PUR_AMT) gained according to a specific action a under state s
- Episode ep
- 1 day
- 각 day마다 24개의 step, 각 episode는 같은 일반적인 experience를 가질 것이다.
Algorithm 1. DQN Learning 참고
Figure 3: Massive-agent Framework
- 각 ad마다 자신만의 state에 대하여 optimal policy를 학습한 독립적인 agent를 deploy -> Agent끼리의 competition때문에 global performance가 감소
- Core idea: private competitive objective + public cooperative objective
- Cooperative reward: global reward of all the agents
- Competitive reward: its own reward
| Hyper-parameter | Setting |
|---|---|
| Target network update period | 10000 steps |
| Memory size | 1 million |
| Learning rate | 0.0001 with RMSProp method |
| Batch size | 300 |
| Network structure | 4-layer DNN |
| DNN layer sizes | [15, 300, 200, 100] |
- 비교 Models
- Baseline: Keyword-level bidding (KB)
- RL with auction-level MDP (AMDP)
- RL with robust MDP (RMDP)
- Massive-agent RL with robust MDP (M-RMDP)
- Evaluation metric
-
PUR_AMT / COSTunder the sameCOSTconstraint
-
(i) How does the DRL algorithm works for search auction data?
(ii) Does RMDP outperform AMDP under changing environment?
(iii) Is the multi-agent problem well handled by M-RMDP?
- Single-agent analysis
- Multi-agent analysis
- Metrics
- Key metric:
PUR_AMT / COST - +:
CVR,ROI,PPC
- Key metric:
- Single-agent analysis
- Multi-agent analysis
- Convergence analysis
- 150 m sample batch가 진행되고 나서야 converge -> DQN algorithm은 data expensive
- Optimal policy를 찾기 위해서는 방대한 양의 데이터와 episode가 필요