Naive Intelligence - mu-zhao/Liars_dice GitHub Wiki

Naive Intelligence

Naive Intelligence(NI) is based on Bayesian inference and Monte Carlo simulation. The key part of NI different from EI is Monte Carlo simulation.

If the math format doesn't render properly, you might consider using MathJax Plugin for Github


General Idea:

NI will observe the bets of other players and form the inference of the distribution of others' dice results. Then instead of making decisions based on the $z$-score, it runs simulations based on the perceived distribution. That is, we generate the rollout results of each player(except ourselves because it is known), starting from the current step, simulate each players' choice based on their dice results. As always, we have to make assumptions about the principles on which other players respond. And the simulation results will give us a more accurate estimation of the payoff of different options. And the AI will make the option that maximizes the expected payoff.

Payoff & reward function:

First of all, let's talk about the payoff & "payoff". The payoff function or the ideal reward function should be the winning probability under certain configurations. However, it is not feasible since we don't know the probability.

A very good proxy is the number of dice one has. It is the greedy algorithm. If at each round we lose the least number of dice, we are most likely to be the winner. This is basically the reward function used in ZI and EI. I shall point out that the actual reward function used is a little different: When raising the bet, we raise to the bet that will cause minimal imminent loss in the following turn rather than the entire round. This is the greedy algorithm at another level.

As one can imagine, the greedy algorithm, oftentimes a simple but good one, isn't necessarily the best. To overcome this drawback, we can use simulations. Fortunately, we have knowledge about other player's distribution based on Bayesian inference, which is the foundation for simulation.

As for the reward function, we consider the following model:

Suppose the number of dice owned by players is $n_1,n_2,\cdots,n_k$, then we can use the reward function $$\omega_i(n_1,n_2,\cdots,n_k)=\frac{\phi(n_i)}{\sum_{j=1}^k\phi(n_j)}.$$ Where $\phi$ is the relative power function. For instance, if $\phi(x)=x$, then this function basically gives rise to the same reward function as the greedy algorithm did: it does not care which opponent will lose a die. In contrast, if we choose $\phi(x)=x^2$, then this function will reward targeting the stronger opponents: it would strongly prefer the players with more dice to lose. This behavior actually makes perfect sense. If we have the opportunity, we'd rather weaken our strongest opponent than a weaker one, as the game favors the players with more dice.

Monte Carlo Simulation:

We run simulations based on the inferred distribution. Then we need to know the error of the estimation. Suppose we run the simulation $N$ times and we would like to have the relative error not exceeding $\beta\%$ at certain confidence level $\alpha$. Then we have: $$a\cdot\frac{\sqrt{Np(1-p)}}{N}\leq \frac{p\beta}{100}$$ Where $erf(a)=\alpha$. That is, $$N\geq \frac{10000\cdot a^2(1-p)}{\beta^2p}$$

Discounting

Since the simulations are based on the inferred data, the deep a single simulation goes, the less confident we are about the result, and therefore we use a discounting factor to account for this loss of confidence. In practice, compared with NI without discounting, we found NI discounting behaves much more like a human player and in general slightly better.