artificial intelligence - Neural Network and Temporal Difference Learning -


i have read few papers , lectures on temporal difference learning (some pertain neural nets, such sutton tutorial on td-gammon) having difficult time understanding equations, leads me questions.

-where prediction value v_t come from? , subsequently, how v_(t+1)?

-what getting propagated when td used neural net? is, error gets propagated come when using td?

the backward , forward views can confusing, when dealing simple game-playing program, things pretty simple in practice. i'm not looking @ reference you're using, let me provide general overview.

suppose have function approximator neural network, , has 2 functions, train , predict training on particular output , predicting outcome of state. (or outcome of taking action in given state.)

suppose have trace of play playing game, used predict method tell me move make @ each point , suppose lose @ end of game (v=0). suppose states s_1, s_2, s_3...s_n.

the monte-carlo approach says train function approximator (e.g. neural network) on each of states in trace using trace , final score. so, given trace, call:

train(s_n, 0) train(s_n-1, 0) ... train(s_1, 0).

that is, i'm asking every state predict final outcome of trace.

the dynamic programming approach says train based on result of next state. training like

train(s_n, 0) train(s_n-1, test(s_n)) ... train(s_1, test(s_2)).

that is, i'm asking function approximator predict next state predicts, last state predicts final outcome trace.

td learning mixes between 2 of these, λ=1 corresponds first case (monte carlo) , λ=0 corresponds second case (dynamic programming). suppose use λ=0.5. our training be:

train(s_n, 0) train(s_n-1, 0.5*0 + 0.5*test(s_n)) train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+) ...

now, i've written here isn't correct, because don't re-test approximator @ each step. instead start prediction value (v = 0 in our example) , update training next step next predicted value. v = λ·v + (1-λ)·test(s_i).

this learns faster monte carlo , dynamic programming methods because aren't asking algorithm learn such extreme values. (ignoring current prediction or ignoring final outcome.)


Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -