Cross Entropy - aragorn/home GitHub Wiki

Entropy

Physics or Statistical Thermodynamics
- https://en.wikipedia.org/wiki/Entropy
  In statistical thermodynamics, entropy (usual symbol S) is a measure of the number of microscopic configurations Ω that a thermodynamic system can have when in a state as specified by certain macroscopic variables.
  $entropy$
  $change in entropy$ ]
Information Theory
- https://en.wikipedia.org/wiki/Entropy_(information_theory)
  In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modifies the message in some way. The receiver attempts to infer which message was sent. In this context, entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. 'Messages' can be modeled by any flow of information.
  - Named after Boltzmann's Η-theorem, Shannon defined the entropy Η (Greek capital letter eta) of a discrete random variable X with possible values {x1, ..., xn} and probability mass function P(X) as:
    $entropy of X$
    Here E is the expected value operator, and I is the information content of X. I(X) is itself a random variable.
  - $entropy of X$
    where b is the base of the logarithm used. Common values of b are 2, Euler's number e, and 10, and the unit of entropy is shannon for b = 2, nat for b = e, and hartley for b = 10.[6] When b = 2, the units of entropy are also commonly referred to as bits.
Explanation on entropy using colored flashing lights
- http://stackoverflow.com/a/1860077

Entropy and Information Gain

http://stackoverflow.com/questions/1859554/what-is-entropy-and-information-gain

Kullback–Leibler divergence

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

In probability theory and information theory, the Kullback–Leibler divergence,[1][2] also called information divergence, information gain, relative entropy, KLIC, or KL divergence, is a measure (but not a metric) of the non-symmetric difference between two probability distributions P and Q. The Kullback–Leibler divergence was originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback himself preferred the name discrimination information.[3] It is discussed in Kullback's historic text, Information Theory and Statistics.[2]

Expressed in the language of Bayesian inference, the Kullback–Leibler divergence from Q to P, denoted DKL(P‖Q), is a measure of the information gained when one revises one's beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P.[4] In applications, P typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while Q typically represents a theory, model, description, or approximation of P.

Cross Entropy

Wikipedia

https://en.wikipedia.org/wiki/Cross_entropy

In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p.

동일한 사건 집합에 대한 두 확률 분포 p, q가 있을 때, "실제" 분포인 p 가 아니라 "가상" 확률분포 q 에 최적화된 부호화 방식을 적용하였을 때, 사건 집합의 사건을 식별하는데 필요한 평균 비트수를 측정한다.

$cross entropy for discrete p and q$

Explanations

Bernoulli process

Wikipedia

https://en.wikipedia.org/wiki/Bernoulli_process

A Bernoulli process is a finite or infinite sequence of independent random variables X1, X2, X3, ..., such that

For each i, the value of Xi is either 0 or 1;
For all values of i, the probability that Xi = 1 is the same number p. In other words, a Bernoulli process is a sequence of independent identically distributed Bernoulli trials.

Independence of the trials implies that the process is memoryless. Given that the probability p is known, past outcomes provide no information about future outcomes. (If p is unknown, however, the past informs about the future indirectly, through inferences about p.)