__sK__  _RL: learning of v-values_

Create a training set for the ML method from sJ:
1. sample trajectories from the current policy.
2. create a training set and learn new parameters of ML (sJ)
learning tuples `(s,r)`, where r is the cumulative reward (paid only at end) and s is any state on the policy episode (rollout).
3. goto 1.

- [ ] [abc](link)
- [ ] [abc](link)
- [ ] [abc](link)

__step: 5__