artificial intelligence - Neural Network and Temporal Difference Learning -


i have read few papers , lectures on temporal difference learning (some pertain neural nets, such sutton tutorial on td-gammon) having difficult time understanding equations, leads me questions.

-where prediction value v_t come from? , subsequently, how v_(t+1)?

-what getting propagated when td used neural net? is, error gets propagated come when using td?

the backward , forward views can confusing, when dealing simple game-playing program, things pretty simple in practice. i'm not looking @ reference you're using, let me provide general overview.

suppose have function approximator neural network, , has 2 functions, train , predict training on particular output , predicting outcome of state. (or outcome of taking action in given state.)

suppose have trace of play playing game, used predict method tell me move make @ each point , suppose lose @ end of game (v=0). suppose states s_1, s_2, s_3...s_n.

the monte-carlo approach says train function approximator (e.g. neural network) on each of states in trace using trace , final score. so, given trace, call:

train(s_n, 0) train(s_n-1, 0) ... train(s_1, 0).

that is, i'm asking every state predict final outcome of trace.

the dynamic programming approach says train based on result of next state. training like

train(s_n, 0) train(s_n-1, test(s_n)) ... train(s_1, test(s_2)).

that is, i'm asking function approximator predict next state predicts, last state predicts final outcome trace.

td learning mixes between 2 of these, λ=1 corresponds first case (monte carlo) , λ=0 corresponds second case (dynamic programming). suppose use λ=0.5. our training be:

train(s_n, 0) train(s_n-1, 0.5*0 + 0.5*test(s_n)) train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+) ...

now, i've written here isn't correct, because don't re-test approximator @ each step. instead start prediction value (v = 0 in our example) , update training next step next predicted value. v = λ·v + (1-λ)·test(s_i).

this learns faster monte carlo , dynamic programming methods because aren't asking algorithm learn such extreme values. (ignoring current prediction or ignoring final outcome.)


Comments

Popular posts from this blog

How to access named pipes using JavaScript in Firefox add-on? -

multithreading - OPAL (Open Phone Abstraction Library) Transport not terminated when reattaching thread? -

node.js - req param returns an empty array -