When you are confronted with a decision, there are a number of different alternatives (actions) you have to choose from. Choosing the best action requires thinking about more than just the immediate effects of your actions. The immediate effects are often easy to see, but the long term effects are not always as transparent. Sometimes actions with poor immediate effects, can have better long term ramifications. You would like to choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution. People do this type of reasoning daily: some do it better than others.

What usually makes this particularly difficult is that there is a lot of uncertainty about the future. The outcomes of certain actions in the future are not entirely predicable, and one sometimes doesn't even know if the action will even matter in the future. A Markov decision process is a way to model problems so that we can automate this process of decision making in uncertain environments.

If you can model the problem as an MDP, then there are a number of algorithms that will allow you to automatically solve the decision problem. We will first talk about the components of the model that are required. We follow this by defining what it means to solve the problems and finally, we describe a simple algorithm for finding a solution to the problem given a model of this type.

The four components of an MDP model are: a set of states, a set of actions, the effects of the actions and the immediate value of the actions. We will assume that the set of state and actions is discrete, and allow assume that time passes in uniform, discrete intervals. We now discuss the model components in more detail.

### States

When making a decision, you need to think about how your actions will affect things. The state is the way the world currently exists and an action will have the effect of changing the state of the world. If we think about the set of every possible way the world could be, then this is would be the set of state of the world. Each of these states would be a state in the MDP.

### Actions

The actions are the set of possible alternative choices you can choose to make. The main problem of solving an MDP is to find the best action to take in each particular state of the world.

### Transitions

When we are deciding between different actions, we have some idea of how they will immediately affect the current state. The transitions specify how each of the actions change the state. Since an action could have different effects, depending upon the state, we need to specify the action's effect for each state in the MDP

The most powerful aspect of the MDP is that the effects of an
action can be probabilistic. Imagine we are specifying the effects of
doing action '`a1`' in state '`s1`'. We could say that
the effect of '`a1`' is to leave the process in state
'`s2`', if there was no question about how '`a1`'
changes the world. However, many decision processes have actions that
are not this simple. Sometimes an action usually results in state
'`s2`', but occasionally it might result in state
'`s3`'. MDPs allow you to specify these more complex
actions by allowing you to specify a set of resulting states and the
probability that each state results.

### Immediate Rewards

If we want to automate the decision making process, then we must be able to have some measure of an action's cost or state's value so that we can compare different alternative action policies over the long term. We specify some immediate value for performing each action in each state.

### The solution to an MDP

The solution to an MDP is called a *policy* and it
simply specifies the best action to take for each of the states. For
the purposes of this tutorial, we will only concern ourselves with the
problem of finding the best policy assuming we will have a limited
lifetime. For instance, assume that each day that we wake up we must
make a decision about something. We want to make our decision
assuming that we will only have to make decisions for a fixed number
of days. This type of solution is called a *finite horizon*
solution and the number of days we will be making decisions for will
be referred to as the *horizon length*.

Although the policy is what we are after, we will actually compute a
*value function*. Without going into the details, suffice it
to say that we can easily derive the policy if we have the value
function. Thus, we will focus on finding this value function. A
value function is similar to a policy, except that instead of
specifying an action for each state, it specifies a numerical value
for each state.

### Who's this Markov guy?

Who this guy is isn't important now, but here we try to explain what his name means as far as MDPs are concerned.

When we talked about the transitions of the model, we said that we
simply needed to specify the resulting next state for each starting
state and action. This assumes that the next state depends only upon
the current state (and action). There are situations where the
effects of an action might depend not only on the current state, but
upon the last few states. The MDP model will not let you
model these situations directly. The assumption made by the
MDP model is that the next state is solely determined by the
current state (and current action). This is called the
*Markov* assumption.

We would say that the dynamics of the process are Markovian and this has important ramifications for solving the problems. The most important is that our policies or value functions can have a simple form. The result is that we can choose our action based solely upon the current state.

## Continue

**POMDP Tutorial | Next**