Search courses or pages...
Decide which part is the chooser and which part is the world it acts in. Locate the policy as the agent’s action-selecting rule, and keep it separate from rewards, observations, and environment dynamics.
Inspect an environment interface to identify what the agent receives and what it is allowed to send back. Distinguish observation from hidden environment state, action from outcome, and valid action choices from impossible moves.
Apply the previous explanations in a guided problem.
Use reset to get the first observation, then parse each step’s reply as next observation, reward, termination or truncation, and optional info using the Gymnasium/OpenAI Gym convention. Treat reward as immediate feedback and an episode ending as the signal to stop acting and reset.
Work through a tiny environment row by row with columns for current observation, chosen action, reward, next observation, and done signal. Practice updating the loop correctly without adding returns, neural networks, or training code.
Check your understanding with a short quiz.
Review this chapter with practice based on your mistakes.