Deep RL for SS RTB - lshhhhh/deep-learning-study GitHub Wiki

Deep Reinforcement Learning for Sponsored Searched Real-time Bidding

논문: https://arxiv.org/abs/1803.00259
Paper Collection of Real-Time Bidding

Overview

  • Dynamic environment 문제를 해결하는 reinforcement learning solution
  1. How to handle environment changing problem?
    • 1시간 단위의 auction 데이터에 대한 robust MDP model를 고안
    • Impression 데이터를 적절히 aggregation (e.g. by hour)하며, 주기적인 (state transition에 대한) 패턴을 관찰
  2. SS-RTB를 위한 control-by-model framework를 고안
    • Bid price를 직접 예측하기보다는, 해당 hour의 impression을 위한 bidding model을 만들고 real-time bidding을 한다.
    • 즉, real-time bidding model을 control하기 위한 optimal parameter policy를 학습

    <-> control-by-action

  3. Massive agent learning algorithm
    • Reward: competitive reward + cooperative reward

Problem Definition

  • Goal
    • Day d에 수입으로 들어오는 구매량 PUR_AMTd 를 최대화, 비용 COSTd 를 최소화
    • constraint: PUR_AMTd 는 advertiser의 기대치인 g 보다 작아서는 안 된다.
  • 식:
    • max PUR_AMTd / COSTd s.t. PUR_AMTd >= g
    • = max PUR_AMTd s.t. COSTd = c

Dataset

  • Alibaba's search auction platform의 1000개의 큰 ads
    • 이틀치 (one day collection for train, the other day collection for test)
    • 하루에 1억(100 m) auctions

Methodology

Robust MDP model

  • Figure 2: Patterns of Different Level Clicks에 따르면 hour-level curve가 두 날짜에 대하여 가장 비슷한 패턴을 가진다.
  • State s: <b, t, auct>
    • b: the budget left for the ad
      • not the budget preset by the advertiser
      • the cost that the ad expects to expend in the left steps of auctions
    • t: step number of sequence
    • auct: feature vector
      • aggregated statistical features of auctions in time period t
      • click_cnt, imp_cnt, cost, ctr, cvr, ppc, etc.
  • Action a
    • decision of generating the real-time bidding price for each auction
    • 직접 bid price를 생성하기보다는,
    • control-by-model: linear approximator
    • Set alpha
      • opt_bidprice = alpha * PCVR

      PCVR: predictive conversion rate
      Previous studies have shown that the optimal bid price has a linear relationship with the impression-level evaluation (e.g. CTR)

  • Reward r(s, a)
    • The income (in terms of PUR_AMT) gained according to a specific action a under state s
  • Episode ep
    • 1 day
    • 각 day마다 24개의 step, 각 episode는 같은 일반적인 experience를 가질 것이다.

Algorithm

Algorithm 1. DQN Learning 참고

Massive-agent Model

Figure 3: Massive-agent Framework

  • 각 ad마다 자신만의 state에 대하여 optimal policy를 학습한 독립적인 agent를 deploy -> Agent끼리의 competition때문에 global performance가 감소
  • Core idea: private competitive objective + public cooperative objective
    1. Cooperative reward: global reward of all the agents
    2. Competitive reward: its own reward

Hyper-parameters settings

Hyper-parameter Setting
Target network update period 10000 steps
Memory size 1 million
Learning rate 0.0001 with RMSProp method
Batch size 300
Network structure 4-layer DNN
DNN layer sizes [15, 300, 200, 100]

Evaluation

  • 비교 Models
    • Baseline: Keyword-level bidding (KB)
    • RL with auction-level MDP (AMDP)
    • RL with robust MDP (RMDP)
    • Massive-agent RL with robust MDP (M-RMDP)
  • Evaluation metric
    • PUR_AMT / COST under the same COST constraint

Offline

(i) How does the DRL algorithm works for search auction data?
(ii) Does RMDP outperform AMDP under changing environment?
(iii) Is the multi-agent problem well handled by M-RMDP?

  1. Single-agent analysis
  2. Multi-agent analysis

Online

  • Metrics
    • Key metric: PUR_AMT / COST
    • +: CVR, ROI, PPC
  1. Single-agent analysis
  2. Multi-agent analysis
  3. Convergence analysis
    • 150 m sample batch가 진행되고 나서야 converge -> DQN algorithm은 data expensive
    • Optimal policy를 찾기 위해서는 방대한 양의 데이터와 episode가 필요
⚠️ **GitHub.com Fallback** ⚠️