# Models

## The Units Mix Model

See [the write-up](https://www.overleaf.com/10674832mcjzvbpkwtdj#/40006644/)

At run time (`forward()`):
 - get the features and put them in a sparse vector (`Featurizer`)
 - all the nonzero features, look-up their weight, multiply by the value of the
   feature, and sum (sparse dot product), for each of the outputs (linear layer)
 - softmax the output vector
 - sample an action (`UnitsMixModel`)
 - TODO random projection of the fetures (random kitchen sinks)
 - TODO polynomial kernel (order 2 or 3) of the features

At train time (`backward()`):
 - TODO Zero-order optim: instead of sampling the actions, take the max, but
   sample `u` on the unit sphere (or N(0,1)) before each episode and use `w +
delta * u` as weight, and do `w += lr * R * u` updates.
 - TODO REINFORCE: for a batch quintuplet (weights, [[inputs]], [[outputs]],
   [rewards]), compute the policy gradient loss and take a gradient step, here
we will only have one -1/1 reward per game and don't want to discount, so R=r,
and so for step `i` for action `j`: `l(w) = - R * log(softmax(w*x)[j])`, the
gradient is `- R * (dsoftmax(w*x)/dw) / log(softmax(w*x))`. For n games, have
`L(w) = avg_n[avg_i[l(w)]]`.
 - TODO advantage instead of reward (first with running average baseline)!!!
 - TODO Actor-Critic or PPO.
 - TODO Off-policy version with importance sampling weighting: multiply the
   gradients (for each state, x, and action) by `c = softmax(w*x)/output`
(probability according to the current policy divided by probability according
to the behavior policy).
 - TODO Off-policy version with Retrace(lambda): `lambda * min(1, c)`.

We need to parallelize the playing of games for wall-clock effiency... We could
do the updates locally efficiently only if we don't have to send gradients to
other peers. One solution is to do a weight server (e.g. could ZMQ around my
current hack for starters) that updates the weights (sync SGD) and receives
inputs/outputs/rewards from clients. Another (my hack here) is to play a game
and save to disk:
 - weights
 - win/loss
 - for each inference:
   - inputs features
   - output probas
   - chosen output
While we synchronously train over those outputs that have finished, cancel
outstanding jobs, relaunch jobs with new weights.
