We will now show an example of value iteration proceeding on a problem
for a horizon length of `3`. This example will provide some of
the useful insights, making the connection between the figures and the
concepts that are needed to explain the general problem. For this
problem, we assume the POMDP has two states, two actions and
three observations.

We start with the first horizon. The value function here will
represent the best we can hope to do (in terms of value) if we are
limited to taking a single action. This is the simplest case;
normally (for horizon length `h`) we need to trade off the
immediate rewards and the future rewards. However, when you have a
horizon of `1`, there is no future and the value function
becomes nothing but the immediate rewards.

Since we have two states and two actions, our POMDP model will include four separate immediate reward values: there is one value for each combination of action and state. These values are defined over the discrete state space of the POMDP, but it becomes easy to get the value of doing a particular action in a particular belief state. To do this we simply use the probabilities in the belief state to weight the value of each state.

As an example: let action `a1` have a value of `0` in
state `s1` and `1` in state `s2` and let action
`a2` have a value of `1.5` in state `s1` and
`0` in state `s2`. If our belief state is `[ 0.75
0.25 ]` then the value of doing action a1 in this belief state is
`0.75 x 0 + 0.25 x 1 = 0.25`. Similarly, action `a2`
has value `0.75 x 1.5 + 0.25 x 0 = 1.125`. We can display
these values over belief space with the figure below. This is, in
fact, *the* horizon `1` value function. (Note that we
have not violated the "no formula" promise: what preceded were not
*formulas*, they were just *calculations*.)

### Horizon 1 value function

The immediate rewards for each action actually specifies a linear
function over belief space. Since we are interested in choosing the
best action, we would choose whichever action gave us the highest
value, which depends on the particular belief state. So we actually
have a PWLC value function for the horizon `1` value function
simply by considering the immediate rewards that come directly from
the model. In the figure above, we also show the partition of belief
space that this value function imposes. Here is where the colors will
start to have some meaning. The blue region is all the belief states
where action `a1` is the best strategy to use, and the green
region is the belief states where action `a2` is the best
strategy.

With the horizon `1` value function we are now ready to
construct the horizon `2` value function. This part of the
tutorial is the most crucial for understanding POMDP
solutions procedures. Once you understand how we will build the
horizon `2` value function, you should have the necessary
intuition behind POMDP value functions to understand the
various algorithms.

Our goal in building this new value function is to find the best
action (or highest value) we can achieve using only two actions (i.e.,
the horizon is `2`) for every belief state. To show how to
construct this new value function, we break the problem down into a
series of steps.

- We will first show how to compute the value of a single belief state for a given action and observation.
- Then we show how to compute the value for every belief state for a given action and observation, in a finite amount of time.
- Then we will show how to compute the value of a belief state given only an action.
- Finally, we will show how to compute the actual value for a belief state.

### Computing a Belief State Value from an Action and Observation

We start with the problem: given a particular belief state, `b`
what is the value of doing action `a1`, if after the action we
received observation `z1`? In other words we want to find the
best value possible for a single belief state when the immediate
action and observation are fixed.

The value of a belief state for horizon `2` is simple the value
of the immediate action plus the value of the next action. In
general, we would like to find the best possible value which would
include considering all possible sequences of two actions. However,
since in our restricted problem our immediate action is fixed, the
immediate value is fully determined. We can use the immediate rewards
for action `a1` to find the value of `b` just like we
did in constructing the horizon `1` value function. Recall
that the horizon `1` value function is nothing but the
immediate reward function.

The only question is what is best achievable value for the belief
state that results from our initial belief state `b` when we
perform action `a1` and observe `z1`. This isn't really
much of a problem at all, since we know our initial belief state, the
action and the resulting observation. This is all that is required to
transform `b` into the unique resulting next belief state,
which we will call `b'`. This new belief state will be the
belief state we are in when we have one more action to perform; our
horizon length is `2`, but we just did one of the 2 possible
actions. We know what the best values are for every belief state when
there is a single action left to perform; this is exactly what our
horizon `1` value function tells us.

The figure below shows this process. On the left is the immediate
reward function and on the right is the horizon `1` value
function. (Recall that for horizon length 1, the immediate rewards are
the same as the value function.) The immediate rewards for action
`a2` are shown with a dashed line, since they are not of
immediate interest when considering the fixed action `a1`.

### Value of a fixed action and observation

Here we will define `T` as the function that transforms the
belief state for a given belief state, action and observation (the
formulas are hiding in here). Note that from looking at where
`b'` is, we can immediately determine what the best action we
should do after we do the action `a1`. The belief state
`b'` lies in the green region, which means that if we have a
horizon length of `2` and are forced to take action `a1`
first, then the best thing we could do afterwards is action
`a2`.

Recall that what we are concerned with at this point is finding the
value of the belief state `b` with the fixed action and
observation. We have everything we need to calculate this value; we
know what the immediate reward we will get is and we know the best
value for the transformed belief state `b'`. Simply summing
these two values gives us the value of belief state `b` given
that we take action `a1` and observe `z1`. As a side
effect we also know what is the best next action to take.

### Computing All Belief State Values for an Action and Observation

Suppose we want to find the value for another belief state, given the
same action and observation. We simply repeat this process, which
means all that we really need to do is transform the belief state and
use the horizon `1` value function to find what value it has
(the immediate rewards are easy to get). Now suppose we want to find
the value for all the belief points given this fixed action and
observation. This seems a little harder, since there are way to many
points we have to do this for. Fear not, this can actually be done
fairly easily.

As we compute the horizon `2` value function for a given
initial belief state, action and observation, we would transform the
belief state to anew point in belief space and use the horizon
`1` value function to simply lookup the value of this
transformed space. Suppose we ignore worry about factoring in the
immediate rewards before transforming the belief state. Imagine we
plotted this function: for every belief state, transform it (using a
particular action and observation) and then lookup the horizon
`1` value of the new belief. This would give you another value
function over belief space, which would be the horizon `1`
value function, but slightly transformed from the original. The
transformation results from having factoring in the belief update
calculation.

This imaginery algorithm cannot actually be implmented directly
since there are uncountably infinite number of belief states we would
need to do this for. However, it turns out that we can directly
construct a function over the entire belief space from the horizon
`1` value function that has the belief transformation built in.
This gives us a function which directly tells us the value of each
belief state after the action `a1` is taken and observation
`z1` is seen without having to actually worry about
transforming the belief state. The figure below show this
transformation.

### Transformed value function

We will use `S()` to represent the transformed value function,
for a particular action and observation. The very nice part of this
is that the transformed value function is also `PWLC` and the
nicer part is that it always is this way. (Sorry, the proof requires
formulas and we can't do those here.) Now if we want to find the
value of a belief state for the fixed action and observation we can
just add the immediate rewards for the belief state and add the value
we directly get from the transformed function `S(a1,z1)`. In
fact we could just add these two functions (immediate rewards and the
transformed horizon `1` value function) together to get a
single function for the value of all belief points, given action
`a1` and observation `z1`. (Note: this is a slight lie
and we will explain why a bit later.)

### Computing a Belief State Value for a Single Action

We previously decided to solve the simple problem of finding the value of a belief state, given a fixed action and observation. We have showed this and actually demonstrated how to find the value of all belief states given a fixed action and observation. Next we want to show how to compute the value of a belief state given only the action.

When we were looking at individual points, getting their immediate
reward value, transforming them and getting the resulting belief
states value, we where computing the conditional value. This is the
value *if* we see observation `z1`. However, because
the observations are probabilistic, we are not guaranteed to see
`z1`. In this example, there are three possible observations
and each one can lead to a separate resulting belief state. This
figure below, shows the full situation when we fix our first action to
be `a1`.

### Transformed value function

Even though we know the action with certainty, the observation we get
is not known in advance. To get the true value of the belief point
`b` we need to account for all the possible observations we
could get. The assumption that we knew the resulting observation was
a convenience to explain how the process works, but to build a value
function for horizon `2` we need to be able to compute the
value of the belief states without prior knowledge of what the outcome
will be. All of this is really not that difficult though. For
a given belief state, each observation has a certain probability
associated with it. If we know the value of the resulting belief
state given the observation, to get the value of the belief state
without knowing the observation is just a matter of weighting each
resulting value by the probability that we will actually get that
observation.

This might still be a bit cloudy, so let us do an example. Suppose
that we compute the values of the resulting belief states for belief
state `b`, action `a1` and all three observations and
find that the values for each resulting belief state are: `z1:0.8,
z2:0.7, z3:1.2`. These are the values we were initially
calculating when we were doing things one belief point at a time. Now
we also can compute the probability of getting each of the three
observations for the given belief state and action and find them to
be: ` z1:0.6, z2:0.25, z3:0.15`. Then the horizon `2`
value of the belief state `b` when we fix the action at
`a1` is `0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835` plus the
immediate reward of doing action `a1` in `b`.

In fact, the transformed value function `S(a1,z1)` we showed
before actually factors in the probabilities of the observation. So
in reality, the `S()` function is not quite what we claimed; we
claimed that it was the next belief state value of each belief state
for the fixed action and *given* the observation. It reality,
the `S()` function already has the probability of the
observation built into it.

Let's look at the situation we currently have with the figure below.

### Transformed value function for all observations

This figure shows the transformation of the horizon `1` value
function for the action `a1` and all three observations.
Notice that the value function is transformed differently for all
three observations and that each of these transformed functions
partitions the belief space differently. How the value function is
transformed depends on the specific model parameters. What all this
implies is that the best next action to perform depends not only upon
the initial belief state, but also upon exactly which observation we
get.

So what is the horizon `2` value of a belief state, given a
particular action `a1`? Well, it depends not only on the value
of doing action `a1` but also upon what action we do next
(where the horizon length will be `1`). However, what we do
next will depend upon what observation we get. For a given belief
state and observation, we can look at the `S()` function
partition to decide what the best action next action to do is. This
is best seen with a figure.

### Partitions for all observations

This figure is just the `S()` partitions from the previous
figure displayed adjacent to each other. The blue regions are the
belief states where action `a1` is the best next action, and
the green regions are where `a2` would be best. Now let's focus
on the problem of finding the best value of a belief state `b`
given that the first action is fixed to be `a1`. We will use
the point shown in the figure below.

### Belief point in transformed value function partitions

What we see from this figure is that if we start at the belief point
`b`, do action `a1`, then the next action to do would be
`a1` if we observer either `z2` or `z3` and
action `a2` if we observe `z1`. The figure above allows
us to easily see what the best strategies are after doing action
`a1`. If you recall that each of the partition regions actually
corresponds to a line in the `S()` function, we can easily get
the value of the belief point `b`. We take the immediate reward
we get from doing action `a1` and add the value of the functions
`S(a1,z1),S(a1,z3), S(a1,z3)` at belief point `b`.

### Computing the Final Belief State Value

In fact, if we fix the action to be `a1` and the future
strategy to be the same as it is at point `b` (namely:
`z1:a2, z2:a1, z3:a1`) we can find the value of every single
belief point for that particular strategy. To do this we simply sum
all of the appropriate line segments. We use the line segment for
the immediate rewards of the `a1` action and the line segments
from the `S()` functions for each observation's future
strategy. This gives us a single linear segment (since adding lines
gives you lines) over all belief space representing the value of
adopting the strategy of doing `a1` and the future strategy of
`(z1:a2, z2:a1, z3:a1)`. The notation for the future strategies
just indicates an action for each observation we can get.

We derived this particular future strategy from the belief point
`b` and it is the best future strategy for that belief point.
However, just because we can compute the value of this future strategy
for each belief point, doesn't mean it is the best strategy for all
belief points. So which belief points is this the best future
strategy? This is actually easy to see from the partition
diagram.

### Belief point in transformed value function partitions

The region indicated with the red arrows shows all the belief points
where the future strategy `(z1:a2, z2:a1, z3:a1)` is best
(given that action `a1` is taken first). Since there are three
observations and two actions, there are a total of `8` possible
different future strategies. However, some of those strategies are
not the best strategy for any belief points. Given the partitioning
figures above, all the useful future strategies are easy to pick out.
For our example, there are only `4` useful future strategies.
The figure below shows these four strategies and the regions of belief
space where each is the best future strategy.

### Partition for action a1

Each one of these regions corresponds to a different line segment in
the value function for the action `a1` and horizon length
`2`. Each of these line segments is constructed as we
indicated before by adding the immediate reward line segment to the
line segments for each future strategy. If we created the line
segment for each of the four future strategies from the figure above,
we would get a `PWLC` function that would impose exactly the
partition shown above. This value function is shown in this next
figure.

### Value function and partition for action a1

Note that each one of these line segments represents a particular two
action strategy. The first action is `a1` for all of these
segments and the second action depends upon the observation. Thus we
have solved our second problem; we now know how to find the value of a
belief state for a fixed action. In fact, as before, we have actually
shown how to find this value for every belief state.

If there was only the action `a1` in our model, then the value
function shown in the previous figure would be the horizon `2`
value function. However, because there is another action, we must
compare the value of the other action with the value of action
`a1` before we have the true horizon `2` value
function, since we are interested in finding the best value for each
belief state.

We can repeat the whole process we did for action `a1` for the
other action. This includes constructing the `S()` functions
for the action `a2` and all the observations. From these we
can find the value function for that action. Below is the value
function and partition for action `a2`.

### Value function and partition for action a2

In this case there happens to be only two useful future strategies. We can put the value functions for each action together to see where each action gives the highest value.

### Combined a1 and a2 value functions

We can see that from this picture that there is only one region where
we would prefer to do action `a2`. Everywhere else, this
action is not as good as action `a1`. We can see that one of
the line segments from each of the two action value functions are not
needed, since there are no belief points where it will yield a higher
value than some other immediate action and future strategy. Here is
the more compact horizon `2` value function.

### Value function for horizon 2

The partition shown below the value function in the figure above shows
the best horizon `2` policy, indicating which action should be
taken in each belief state. Notice that there are three different
regions, two of which are adjacent, where we choose action
`a1`. These regions are distinguished because, although the
initial action is the same, the future action strategies will be
different.

This whole process took a long time to explain and is not nearly as
complicated as it might seem. We will show how to construct the
horizon `3` policy from the horizon `2` policy is a
slightly accelerated manner. The steps are the same, but we can now
eliminate a lot of the discussion and the intermediate steps which we
used to aid the explanation.

First transform the horizon `2` value function for action
`a1` and all the observations.

### Transformed Horizon 2 Value Functions for Action `a1`

Note that each of the colors here corresponds to the same colored line
in the horizon `2` value function. Also note that not all of
the transformed lines become useful in the representation of the
maximal values.

When we were constructing the horizon `2` value function,
these colors corresponded to the next action to take. The reason the
colors corresponded to a single action was that with a horizon length
of `2`, there was only going to be a single action left to take
after taking the first action. However, here and in general, each
color represents a complete future strategy, not just one action. For
instance, the magenta color corresponds to the line in the horizon
`2` value function where we would do the action `a2` and
adopt its future strategy. The future strategy of the magenta line
will depend on the observation we get after doing the `a2`
action.

Next we find the value function by adding the immediate rewards and
the `S()` functions for each of the useful strategies. The
partition that this value function will impose is easy to construct by
simply looking at the partitions of the `S()` functions.

In the figure below, we show the `S()` partitions for action
`a1` below and the value function for action `a1` with a
horizon of `3`. The partition this value function imposes is
also shown.

### Value function for action a1 and horizon 3

Note that for this action, there are only 6 useful future strategies, which are represented by the partitions that this value function imposes on the belief space.

We then construct the value function for the other action, put
them together and see which line segments we can get rid of. Here are
the `S()` function partitions, value function and the value
functions partition for the action `a2`.

### Value function for action a2 and horizon 3

Note that there are only `4` useful future strategies for the
action `a2`. Here are the `a1` and `a2` value
functions superimposed upon each other.

### Value functions for both actions a2 and horizon 3

Note how many line segments get completely dominated by line segments
from the other action's value function. The final horizon `3`
value function looks like this:

### Value function for horizon 3

Note that this value function is much simpler than the individual
action value functions. It is even simpler than the horizon `2`
value function. Whether the resulting function is simpler or more
complex depends upon the particular problem. These examples are meant
to show how you can get either one; i.e., the value functions do not
have to get more complex as we iterate through the horizons.

This concludes our example. The concepts and procedures can be
applied over and over to any horizon length. This is the way we do
value iteration on the `CO-MDP` derived from the
POMDP.

## Continue

**Back | POMDP Tutorial | Next**