Harvard

Example Of Infinite Horizon Mdp

Example Of Infinite Horizon Mdp
Example Of Infinite Horizon Mdp

The Markov Decision Process (MDP) is a fundamental framework in decision theory and artificial intelligence, used to model decision-making problems in situations where outcomes are partially random and partially under the control of the decision-maker. One of the key aspects of MDPs is the horizon, which refers to the number of time steps over which the decision-maker plans. An infinite horizon MDP is a type of MDP where the decision-maker plans over an infinite number of time steps, meaning that the process continues indefinitely.

Infinite Horizon MDP: Definition and Formulation

An infinite horizon MDP is defined by a tuple (S, A, P, R, γ), where S is the set of states, A is the set of actions, P is the transition probability function, R is the reward function, and γ is the discount factor. The discount factor γ is a value between 0 and 1 that determines the importance of future rewards. The goal of the decision-maker is to find a policy π that maps states to actions, such that the expected cumulative reward over an infinite horizon is maximized.

Mathematical Formulation

The value function V(s) represents the expected cumulative reward starting from state s and following policy π. The value function can be written as:

V(s) = E[∑γ^t R(s_t, a_t) | s_0 = s, π]

where s_t is the state at time t, a_t is the action taken at time t, and R(s_t, a_t) is the reward received at time t. The action-value function Q(s, a) represents the expected cumulative reward starting from state s, taking action a, and following policy π. The action-value function can be written as:

Q(s, a) = E[∑γ^t R(s_t, a_t) | s_0 = s, a_0 = a, π]

The optimal value function V^*(s) and the optimal action-value function Q^*(s, a) can be computed using the following Bellman equations:

V^(s) = max_a [R(s, a) + γ ∑P(s’|s, a) V^(s’)]

Q^(s, a) = R(s, a) + γ ∑P(s’|s, a) max_a’ Q^(s’, a’)

Policy Iteration and Value Iteration

There are several algorithms that can be used to solve infinite horizon MDPs, including policy iteration and value iteration. Policy iteration involves iteratively improving the policy by computing the value function for the current policy and then updating the policy to be greedy with respect to the value function. Value iteration involves iteratively improving the value function by computing the expected cumulative reward for each state and action, and then updating the value function to be the maximum expected cumulative reward.

AlgorithmDescription
Policy IterationIteratively improve the policy by computing the value function and updating the policy to be greedy
Value IterationIteratively improve the value function by computing the expected cumulative reward and updating the value function to be the maximum expected cumulative reward
💡 Infinite horizon MDPs are widely used in many applications, including robotics, finance, and healthcare, where the decision-maker needs to plan over a long-term horizon.

Example of Infinite Horizon MDP: Grid World

A classic example of an infinite horizon MDP is the grid world problem. In this problem, the decision-maker is an agent that moves in a grid world, and the goal is to reach a target location while avoiding obstacles. The state space S consists of the grid cells, the action space A consists of the four possible movements (up, down, left, right), and the reward function R(s, a) is -1 for each step and 10 for reaching the target location. The discount factor γ is 0.9.

Grid World Problem Formulation

The grid world problem can be formulated as an infinite horizon MDP, where the goal is to find a policy π that maximizes the expected cumulative reward over an infinite horizon. The value function V(s) represents the expected cumulative reward starting from state s and following policy π, and the action-value function Q(s, a) represents the expected cumulative reward starting from state s, taking action a, and following policy π.

The optimal policy π^* can be computed using policy iteration or value iteration, and the optimal value function V^*(s) and the optimal action-value function Q^*(s, a) can be computed using the Bellman equations.

What is the difference between policy iteration and value iteration?

+

Policy iteration involves iteratively improving the policy by computing the value function for the current policy and then updating the policy to be greedy with respect to the value function. Value iteration involves iteratively improving the value function by computing the expected cumulative reward for each state and action, and then updating the value function to be the maximum expected cumulative reward.

What is the discount factor γ in an infinite horizon MDP?

+

The discount factor γ is a value between 0 and 1 that determines the importance of future rewards. A higher value of γ means that future rewards are more important, while a lower value of γ means that future rewards are less important.

Conclusion and Future Work

Infinite horizon MDPs are a powerful framework for modeling decision-making problems in situations where outcomes are partially random and partially under the control of the decision-maker. The grid world problem is a classic example of an infinite horizon MDP, and the optimal policy can be computed using policy iteration or value iteration. Future work includes applying infinite horizon MDPs to more complex problems, such as multi-agent systems and partially observable environments.

Infinite horizon MDPs have many applications in real-world problems, including:

  • Robotics: Infinite horizon MDPs can be used to model the decision-making problem of a robot that needs to navigate in a complex environment.
  • Finance: Infinite horizon MDPs can be used to model the decision-making problem of an investor who needs to allocate assets over a long-term horizon.
  • Healthcare: Infinite horizon MDPs can be used to model the decision-making problem of a healthcare provider who needs to allocate resources over a long-term horizon.
💡 Infinite horizon MDPs are a powerful tool for modeling decision-making problems in complex environments, and they have many applications in real-world problems.

Related Articles

Back to top button