When you are confronted with a decision, there are a number of different alternatives (actions) you have to choose from. Choosing the best action requires thinking about more than just the immediate effects of your actions. The immediate effects are often easy to see, but the long term effects are not always as transparent. Sometimes actions with poor immediate effects, can have better long term ramifications. You would like to choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution. People do this type of reasoning daily: some do it better than others.
What usually makes this particularly difficult is that there is a lot of uncertainty about the future. The outcomes of certain actions in the future are not entirely predicable, and one sometimes doesn't even know if the action will even matter in the future. A Markov decision process is a way to model problems so that we can automate this process of decision making in uncertain environments.
If you can model the problem as an MDP, then there are a number of algorithms that will allow you to automatically solve the decision problem. We will first talk about the components of the model that are required. We follow this by defining what it means to solve the problems and finally, we describe a simple algorithm for finding a solution to the problem given a model of this type.
The four components of an MDP model are: a set of states, a set of actions, the effects of the actions and the immediate value of the actions. We will assume that the set of state and actions is discrete, and allow assume that time passes in uniform, discrete intervals. We now discuss the model components in more detail.
When making a decision, you need to think about how your actions will affect things. The state is the way the world currently exists and an action will have the effect of changing the state of the world. If we think about the set of every possible way the world could be, then this is would be the set of state of the world. Each of these states would be a state in the MDP.
The actions are the set of possible alternative choices you can choose to make. The main problem of solving an MDP is to find the best action to take in each particular state of the world.
When we are deciding between different actions, we have some idea of how they will immediately affect the current state. The transitions specify how each of the actions change the state. Since an action could have different effects, depending upon the state, we need to specify the action's effect for each state in the MDP
The most powerful aspect of the MDP is that the effects of an action can be probabilistic. Imagine we are specifying the effects of doing action 'a1' in state 's1'. We could say that the effect of 'a1' is to leave the process in state 's2', if there was no question about how 'a1' changes the world. However, many decision processes have actions that are not this simple. Sometimes an action usually results in state 's2', but occasionally it might result in state 's3'. MDPs allow you to specify these more complex actions by allowing you to specify a set of resulting states and the probability that each state results.
If we want to automate the decision making process, then we must be able to have some measure of an action's cost or state's value so that we can compare different alternative action policies over the long term. We specify some immediate value for performing each action in each state.
The solution to an MDP
The solution to an MDP is called a policy and it simply specifies the best action to take for each of the states. For the purposes of this tutorial, we will only concern ourselves with the problem of finding the best policy assuming we will have a limited lifetime. For instance, assume that each day that we wake up we must make a decision about something. We want to make our decision assuming that we will only have to make decisions for a fixed number of days. This type of solution is called a finite horizon solution and the number of days we will be making decisions for will be referred to as the horizon length.
Although the policy is what we are after, we will actually compute a value function. Without going into the details, suffice it to say that we can easily derive the policy if we have the value function. Thus, we will focus on finding this value function. A value function is similar to a policy, except that instead of specifying an action for each state, it specifies a numerical value for each state.
Who's this Markov guy?
Who this guy is isn't important now, but here we try to explain what his name means as far as MDPs are concerned.
When we talked about the transitions of the model, we said that we simply needed to specify the resulting next state for each starting state and action. This assumes that the next state depends only upon the current state (and action). There are situations where the effects of an action might depend not only on the current state, but upon the last few states. The MDP model will not let you model these situations directly. The assumption made by the MDP model is that the next state is solely determined by the current state (and current action). This is called the Markov assumption.
We would say that the dynamics of the process are Markovian and this has important ramifications for solving the problems. The most important is that our policies or value functions can have a simple form. The result is that we can choose our action based solely upon the current state.