Create a training set for the ML method from sJ:
- sample trajectories from the current policy.
- create a training set and learn new parameters of ML (sJ)
learning tuples (s,r), where r is the cumulative reward (paid only at end) and s is any state on the policy episode (rollout).
- goto 1.
step: 5
Create a training set for the ML method from sJ:
learning tuples
(s,r), where r is the cumulative reward (paid only at end) and s is any state on the policy episode (rollout).step: 5