bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] TD(lambda) training for neural networks -- a question


From: boomslang
Subject: Re: [Bug-gnubg] TD(lambda) training for neural networks -- a question
Date: Thu, 21 May 2009 11:30:00 +0000 (GMT)

 Hi Øystein / others,


 thanks for your quick answer.
 
 I didn't know gnubg used just TD(0). This does make things
 easier for me.  The Sutton/Barto you're referring
 to..., is that the book "Reinforcement Learning: An
 Introduction"?

 
 I do have a question about this supervised training,
 though. Could you give an indication of the number of games
 it takes to get a good kick start with TD(0), and how big
 should the database with positions/rollouts be for the
 supervised training?
 
 
 Thanks again, I appreciate your help.
 
 --boomslang
 
 
 
> --- On Thu, 21/5/09, Øystein Johansen <address@hidden>
> wrote:
> 
> > From: Øystein Johansen <address@hidden>
> > Subject: Re: [Bug-gnubg] TD(lambda) training for
> neural networks -- a question
> > To: "boomslang" <address@hidden>
> > Cc: address@hidden
> > Date: Thursday, 21 May, 2009, 10:18 AM
> > boomslang wrote:
> > > Hi all,
> > > 
> > > I have a question regarding TD(lambda) training
> by
> > Tesauro (see
> > > http://www.research.ibm.com/massive/tdl.html#h2:learning_methodology).
> > > 
> > > The formula for adapting the weights of the
> neural net
> > is
> > > 
> > > w(t+1)-w(t) = a * [Y(t+1)-Y(t)] *
> sum(lambda^(t-k) *
> > nabla(w)Y(k);
> > > k=1..t).
> > > 
> > > I would like to know if nabla(w)Y(k) in the
> formula
> > above is the
> > > gradient of Y(k) to the weights of the net at
> time t
> > (i.e. the
> > > current net) or to the weights of the net at
> time
> > k.  I assume the
> > > former.
> > 
> > That really doesn't matter much, I believe. I guess,
> as you
> > that it is
> > the former. You can check this with Sutton/Barto I
> guess.
> > 
> > However: This equation was never implemented in gnubg!
> All
> > TD training
> > that was done in gnubg, (and that's a long time ago
> and
> > abandoned at an
> > early stage), was done with lambda = 0. Notice how
> lambda =
> > 0 simplifies
> > the equation. There will only be one term -- when t =
> k.
> > This simplifies
> >  the implementation to only take into account the
> previous
> > position when
> > updating the weights. Can be simply solved with
> backprop.
> > 
> > Our experience is: TD is nice for kickstarting the
> training
> > process. But
> > supervised training is the real thing. Make a big
> database
> > of positions
> > and the rollout results according to these positions
> and
> > train supervised.
> > 
> > If you still would like to do TD training with your
> system,
> > I really
> > recommend looking at Sutton/Barto.
> > 
> > Good luck!
> > -Øystein
> > 
> > 
> 
> 
> 
> 







reply via email to

[Prev in Thread] Current Thread [Next in Thread]