Tutorial 1: Optimal Control for Discrete States#

Week 3, Day 3: Optimal Control

By Neuromatch Academy

Content creators: Zhengwei Wu, Itzel Olivos Castillo, Shreya Saxena, Xaq Pitkow

Content reviewers: Karolina Stosio, Roozbeh Farhoodi, Saeed Salehi, Ella Batty, Spiros Chavlis, Matt Krause, Michael Waskom, Melisa Maidana Capitan

Production editors: Gagana B, Spiros Chavlis

Tutorial Objectives#

Estimated timing of tutorial: 60 min

In this tutorial, we will implement a binary control task: a Partially Observable Markov Decision Process (POMDP) that describes fishing. The agent (you) seeks reward from two fishing sites without directly observing where the school of fish is (yes, a group of fish is called a school!). This makes the world a Hidden Markov Model (HMM), just like in the Hidden Dynamics day. Based on when and where you catch fish, you keep updating your belief about the fish location, i.e., the posterior of the fish given past observations. You should control your position to get the most fish while minimizing the cost of switching sides.

You’ve already learned about stochastic dynamics, latent states, and measurements. These first exercises largely repeat your previous work. Now we introduce actions, based on the new concepts of control, utility, and policy. This general structure provides a foundational model for the brain’s computations because it includes a perception-action loop where the animal can gather information, draw inferences about its environment, and select actions with the greatest benefit. How, mechanistically, the neurons could actually implement these calculations is a separate question we don’t address in this lesson.

In this tutorial, you will:

  • Use the Hidden Markov Models you learned about previously to model the world state.

  • Use the observations (fish caught) to build beliefs (posterior distributions) about the fish location.

  • Evaluate the quality of different control policies for choosing actions.

  • Discover the policy that maximizes utility.


Hide code cell source
# Imports
import numpy as np
from math import isclose
import matplotlib.pyplot as plt

# Figure Settings
# Plotting Functions
# Helper Functions
Video 1: Gone fishing#

# Submit your feedback
Problem Setting

1. State dynamics: There are two possible locations for the fish: Left and Right. Secretly, at each time step, the fish may switch sides with a certain probability \(p_{\rm sw} = 1 - p_{\rm stay}\). This is the binary switching model (Telegraph process) that you’ve seen in the Linear Systems day. The fish location, \(s^{\rm fish}\), is latent; you get measurements about it when you try to catch fish, like in the Hidden Dynamics day. This gives you a belief or posterior probability of the current location given your history of measurements.

2. Actions: Unlike past days, you can now act on the process! You may stay on your current location (Left or Right), or switch to the other side.

3. Rewards and Costs: You get rewarded for each fish you catch (one fish is worth 1 “point”). If you’re on the same side as the fish, you’ll catch more, with probability \(q_{\rm high}\) per discrete time step. Otherwise, you may still catch some fish with probability \(q_{\rm low}\).

You pay a price of \(C\) points for switching to the other side. So you better decide wisely!

Maximizing Utility

To decide “wisely” and maximize your total utility (total points), you will follow a policy that prescribes what to do in any situation. Here the situation is determined by your location and your belief \(b_t\) (posterior) about the fish location (remember that the fish location is a latent variable).

In optimal control theory, the belief is the posterior probability over the latent variable given all the past measurements. It can be shown that maximizing the expected utility with respect to this posterior is optimal.

In our problem, the belief can be represented by a single number because the fish are either on the left or the right side. So we write:

(421)#\[\begin{equation} b_t = p(s^{\rm fish}_t = {\rm Right}\ |\ m_{0:t}, a_{0:t-1}) \end{equation}\]

where \(m_{0:t}\) are the measurements and \(a_{0:t-1}\) are the actions (stay or switch).

Finally, we will parameterize the policy by a simple threshold on beliefs: when your belief that fish are on your current side falls below a threshold \(\theta\), you switch to the other side.

In this tutorial, you will discover that if you pick the right threshold, this simple policy happens to be optimal!

Interactive Demo 1: Examining fish dynamics#

In this demo, we will look at the dynamics of the fish moving from side to side while you stay in one place. Play around with the probability stay_prob of fish staying in the same location, and observe the resulting dynamics of the fish.

Thinking questions:

  • If the fish have already been on one side for a long time, does that change the chances of them switching sides?

  • For what values of p_stay is the fish location most and least predictable?

Execute this cell to enable the demo.

Hide code cell source
# @markdown Execute this cell to enable the demo.
display(HTML('''<style>.widget-label { min-width: 15ex !important; }</style>'''))

@widgets.interact(p_stay=widgets.FloatSlider(.9, description="stay_prob", min=0., max=1., step=0.01))

def update_ex_1(p_stay):
    T: Length of timeline
    p_stay: probability that the fish do not swim to the other side at time t
  params = [p_stay, _, _, _]

  # initial condition: fish [fish_initial] start at the left location (-1)
  binaryHMM_test = binaryHMM(params=params, fish_initial=1, T=100)

  fish_state = binaryHMM_test.fish_dynamics()
# to_remove explanation
In Interactive Demo 1, you should see the school of fish switch sides less often when `stay_prob` is high.

* If the fish have already been on one side for a long time, does that change the chances of them switching sides?

  No. The telegraph process or binary switching process is Markovian.
  That means that the probabilities of changes depend only on the *current* state.
  States from further in the past do not matter for the chances of switching sides.
  Staying longer in one side is not a statement about the current state, but rather about the past,
  so it's irrelevant for the chances of switching.

* For what values of `p_stay` is the fish location most and least predictable?

  When `p_stay` is 1 then the fish never move. But when `p_stay` is 0 then the fish *always* move,
  oscillating back and forth deterministically every discrete time step.

# Submit your feedback
Section 2: Catching fish#

Video 2: Catch some fish#

# Submit your feedback
Interactive Demo 2: Examining the reward function#

In this second demo, you control your location by a button, but we fix the fish’s location by setting stay_prob = 1. Now that the fish are serenely swimming in one location, we can visually inspect the rewards when you’re on the same side as the fish or on the other side.

When you’re on the same side as the fish, you should have a higher probability of catching them (but watch out, since technically, you are allowed to adjust the sliders to other conditions!).

Play around with the sliders high_rew_prob (high reward probability when you’re on the fish’s side) and low_rew_prob (low reward probability when you’re on the other side). The button (same location vs. different location) determines which probability describes how often you catch fish.

Thinking questions:

  • What happens when the fish and the agent (you!) are on the same or different locations?

  • Where do you catch the most fish?

  • Why isn’t low_rew_prob + high_rew_prob = 1? What do these probabilities mean in the fishing story?

  • You can move the sliders so low_rew_prob > high_rew_prob. This doesn’t change the math, but it can change whether the math is a reasonable model of the physical problem. Why?

Hide code cell source
@widgets.interact(locs=widgets.RadioButtons(options=['same location', 'different locations'],
                                            description='Fish and agent:',
                                            layout={'width': 'max-content'}),
                  p_low_rwd=widgets.FloatSlider(.1, description="low_rew_prob:",
                                                min=0., max=1.),
                  p_high_rwd=widgets.FloatSlider(.9, description="high_rew_prob:",
                                                 min=0., max=1.))

def update_ex_2(locs, p_low_rwd, p_high_rwd):
    p_stay: probability of fish staying at current side at time t
    p_low_rwd: probability of catching fish when you're NOT on the side where the fish are swimming
    p_high_rwd: probability of catching fish when you're on the side where the fish are swimming
    fish_initial: initial side of fish (-1 left, 1 right)
    agent_initial: initial side of the agent (YOU!) (-1 left, 1 right)
  p_stay = 1
  params = [p_stay, p_low_rwd, p_high_rwd, _]

  # initial condition for fish [fish_initial] and you [loc_initial]
  if locs == 'same location':
    binaryHMM_test = binaryHMM(params, fish_initial=0, loc_initial=0, T=100)
    binaryHMM_test = binaryHMM(params, fish_initial=1, loc_initial=0, T=100)

  fish_state, loc, measurement = binaryHMM_test.generate_process_lazy()
# to_remove explanation

* What happens when the fish and the agent (you!) are on the same or different locations?
  You catch fish with different probabilities.

* Where do you catch the most fish?
  When you're on the same side as the fish -- as long as high_rew_prob > low_rew_prob.

* Why isn't low_rew_prob + high_rew_prob = 1? What do these probabilities mean in the fishing story?
  These are not probabilities of mutually exclusive events. They are chances of one event (you catch fish)
  under two different conditions (you and the school of fish are on the same side or different sides).

* You _can_ move the sliders so `low_rew_prob > high_rew_prob`. This doesn't change the math,
  but it can change whether the math is a reasonable model of the physical problem. Why?
  It would be weird if you caught less fish when you're on the same side as the fish.
  But hey, maybe the fish warn each other when they're in a school together! Then they'd be harder to catch...

# Submit your feedback
Section 3: Belief dynamics and belief distributions#

Video 3: Where are the fish?#

# Submit your feedback
Interactive Demo 3: Examining the beliefs#

Now it’s time to get an intuition on how beliefs are calculated. Here we define your belief about the fish location is just the posterior probability about that location given your measurements, \(p(s_t|m_{0:t})\). Note that this is just what you did in the day covering Hidden Dynamics!

In this exercise, you’ll always stay on the LEFT side, but the fish will move around. They’ll stay on the same side with probability stay_prob. You only get to see fish you catch, not where the school of fish is. You have to use those measurements to infer the location of the school.

In this demo, play around with the sliders high_rew_prob and low_rew_prob, and stay_prob.

Thinking questions:

  • Manipulate the slider for stay_prob. How well does the belief explain the dynamics of the fish as you adjust the probability of the fish staying in one location (stay_prob)?

  • Explore the extreme case where high_rew_prob = 1 and low_rew_prob = 0. How accurate is the belief as these parameters change?

  • Under what conditions is it informative to catch a fish? What about to not catching a fish?

Hide code cell source
@widgets.interact(p_stay=widgets.FloatSlider(.96, description="stay_prob",
                                             min=.8, max=1., step=.01),
                  p_low_rwd=widgets.FloatSlider(.1, description="low_rew_prob",
                                                min=0., max=1., step=.01),
                  p_high_rwd=widgets.FloatSlider(.3, description="high_rew_prob",
                                                 min=0., max=1., step=.01))

def update_ex_2(p_stay, p_low_rwd, p_high_rwd):
    T: Length of timeline
    p_stay: probability of fish staying at current side at time t
    p_high_rwd: probability of catching fish when you're on the side where the fish are swimming
    p_low_rwd: probability of catching fish when you're NOT on the side where the fish are swimming
    fish_initial: initial side of fish (0 left, 1 right)
    agent_initial: initial side of the agent (YOU!) (0 left, 1 right)
    threshold: threshold of belief below which the action is switching
  threshold = 0.2
  params = [p_stay, p_low_rwd, p_high_rwd, threshold]

  binaryHMM_test = binaryHMM_belief(params, choose_policy="lazy",
                                    fish_initial=0, loc_initial=0, T=100)

  belief, loc, act, measurement, fish_state = binaryHMM_test.generate_process()
  plot_dynamics(belief, loc, act, measurement, fish_state,
# to_remove explanation

* Manipulate the slider for `stay_prob`. How well does the belief explain the dynamics of the fish as
  you adjust the probability of the fish staying in one location (`stay_prob`)?

  The parameter (`stay_prob`) determines fish dynamics. If it is low, the fish are moving fast
  and you don't have much time to collect observations that might decrease your uncertainty about
  the actual location of the school. If it is high, you have more time to integrate evidence
  and the belief explains better the dynamics of the fish.

* Explore the extreme case where `high_rew_prob = 1` and `low_rew_prob = 0`.
  Now play around with these sliders. How accurate is the belief as these parameters change?

  In the extreme case, the belief explains the dynamics of the fish perfectly because
  our observations are perfect, i.e., catching a fish indicates with certainty the presence of the school.
  If the chances of catching a fish are very different between the two sides, then you get a lot of information
  for each fish you catch. The belief will then rise and fall steeply with each observation.
  If the two probabilities are similar, then the belief will change slowly even if the fish move quickly.

* Under what conditions is it informative to catch a fish? What about to *not* catch a fish?

  The bigger the difference in the two probabilities, the more information you get from measurements.
  If both probabilities are low (and different), then you learn a lot from catching a fish.
  But you still learn a little if you don't catch anything, particularly when catching a fish is probable in one case.

# Submit your feedback
Section 4: Implementing a threshold policy#

Video 4: How should you act?#

# Submit your feedback
Coding Exercise 4: dynamics following a threshold-based policy#

Now we’ll switch the policy from the ‘lazy’ policy used above to a threshold policy that you need to write. You’ll change your location whenever your belief is low enough that you’re on the best side. You’ll update the function policy_threshold(threshold, belief, loc). This policy takes three inputs:

  1. The belief about the fish state. For convenience, we will represent the belief at time t using a 2-dimensional vector. The first element is the belief that the fish are on the left, and the second element is the belief the fish are on the right. At every time step, these elements sum to 1.

  2. Your location loc, represented as “Left” = -1 and “Right” = 1.

  3. A belief threshold that determines when to switch. When your belief that you are on the same side as the fish drops below this threshold, you should move to the other location, and otherwise stay.

Your function should return an action for each time t, which takes the value of “stay” or “switch”.

def policy_threshold(threshold, belief, loc):
  chooses whether to switch side based on whether the belief
      on the current site drops below the threshold

    threshold (float): the threshold of belief on the current site,
                        when the belief is lower than the threshold, switch side
    belief (numpy array of float, 2-dimensional): the belief on the
                                                  two sites at a certain time
    loc (int) : the location of the agent at a certain time
                -1 for left side, 1 for right side

    act (string): "stay" or "switch"

  ## 1. Modify the code below to generate actions (stay or switch)
  ##    for current belief and location
  ## Belief is a 2d vector: first element = Prob(fish on Left | measurements)
  ##                       second element = Prob(fish on Right  | measurements)
  ## Returns "switch" if Belief that fish are in your current location < threshold
  ##         "stay" otherwise
  ## Hint: use loc value to determine which row of belief you need to use
  ##       see the docstring for more information about loc
  ## 2. After completing the function, comment this line:
  raise NotImplementedError("Student exercise: Please complete the code")
  # Write the if statement
  if ...:
    # action below threshold
    act = ...
    # action above threshold
    act = ...

  return act

# Next line tests your function

You have to see

Well Done!

# Solution code
def policy_threshold(threshold, belief, loc):
  chooses whether to switch side based on whether the belief
      on the current site drops below the threshold

    threshold (float): the threshold of belief on the current site,
                        when the belief is lower than the threshold, switch side
    belief (numpy array of float, 2-dimensional): the belief on the
                                                  two sites at a certain time
    loc (int) : the location of the agent at a certain time
                -1 for left side, 1 for right side

    act (string): "stay" or "switch"
  # Write the if statement
  if belief[(loc + 1) // 2] <= threshold:
    # action below threshold
    act = "switch"
    # action above threshold
    act = "stay"

  return act

# Next line tests your function
Well Done!

# Submit your feedback
Interactive Demo 4: Dynamics with different thresholds#

The following demo uses the policy you just built! Play around with the slider and observe the dynamics controlled by your policy.

(The code specifies stay_prob=0.95, high_rew_prob=0.3, and low_rew_prob=0.1. You can change these, but these are reasonable parameters. Note: to see the gradual change with threshold, keep reusing the same random; to see different examples, refresh the seed. )

Thinking questions:

  • Qualitatively, how well does this policy follow the fish? What does it miss, and why?

  • How can you characterize the fishing strategy if the threshold is very low, or very high?

Hide code cell source
@widgets.interact(threshold=widgets.FloatSlider(.2, description="threshold", min=0., max=1., step=.01),
                  new_seed=widgets.ToggleButtons(options=['Reusing', 'Refreshing'],
                                                  description='Random seed:',
                                                  button_style='', # 'success', 'info', 'warning', 'danger' or '',
                                                  icons=['check'] * 2
def update_ex_4(threshold, new_seed):
    p_stay: probability fish stay
    high_rew_p: p(catch fish) when you're on their side
    low_rew_p : p(catch fish) when you're on other side
    threshold: threshold of belief below which switching is taken

  if new_seed == "Refreshing":


  params = [stay_prob, high_rew_p, low_rew_p, threshold]

  # initial condition for fish [fish_initial] and you [loc_initial]
  binaryHMM_test = binaryHMM_belief(params, fish_initial=0, loc_initial=0, choose_policy="threshold", T=100)

  belief, loc, act, measurement, fish_state = binaryHMM_test.generate_process()
  plot_dynamics(belief, loc, act, measurement,
                fish_state, binaryHMM_test.choose_policy)
# to_remove explanation

* Qualitatively, how well does this policy follow the fish? What does it miss, and why?

  You generally follow the fish, but there can be a substantial difference in location.
  The belief is not generally very confident when the probabilities of catching fish on the
  two sides are not very different. Depending on your threshold, you might leave just
  from some unlucky times when you're still on the right side. Or you might stay even
  though you have not caught many fish, in the hopes that the fish haven't moved.

* How can you characterize the fishing strategy if the threshold is very low, or very high?

  If the threshold is low, you only switch when you have a very low belief that you're on the right side.
  Then you switch very rarely.
  If the threshold is high, then you switch whenever you're not extremely confident,
  so you change sides all the time.

# Submit your feedback
Section 5: Implementing a value function#

Video 5: Evaluate policy#

# Submit your feedback
Coding Exercise 5.1: Implementing a value function#

Let’s find out how good our threshold is. For that, we will calculate a value function that quantifies our utility (total points). We will use this value to compare different thresholds; remember, our goal is to maximize the amount of fish we catch while minimizing the effort involved in changing locations.

The value is the total expected utility per unit time.

(422)#\[\begin{equation} V(\theta) = \frac{1}{T}\left( \sum_t R(s_t) - C(a_t) \right) \end{equation}\]

where \(R(s_t)\) is the instantaneous reward we get at location \(s_t\) and \(C(a_t)\) is the cost we paid for the chosen action. Remember, we receive one point for fish caught and pay cost_sw points for switching to the other location.

We could take this average mathematically over the probabilities of rewards and actions. However, we can get the same answer by simply averaging the actual rewards and costs over a long time. This is what you are going to do.

Instructions: Fill in the function get_value(rewards, actions, cost_sw).

def get_value(rewards, actions, cost_sw):
  value function

    rewards (numpy array of length T): whether a reward is obtained (1) or not (0) at each time step
    actions (numpy array of length T): action, "stay" or "switch", taken at each time step.
    cost_sw (float): the cost of switching to the other location

    value (float): expected utility per unit time
  actions_int = (actions == "switch").astype(int)

  ## 1. Modify the code below to compute the value function (equation V(theta))
  ## 2. After completing the function, comment this line:
  raise NotImplementedError("Student exercise: Please complete the code")
  # Calculate the value function
  value = ...

  return value

# Test your function

You will see

Well Done!

# Solution code
def get_value(rewards, actions, cost_sw):
    rewards (numpy array of length T): whether a reward is obtained (1) or not (0) at each time step
    actions (numpy array of length T): action, "stay" or "switch", taken at each time step.
    cost_sw (float): the cost of switching to the other location

    value (float): expected utility per unit time
  actions_int = (actions == "switch").astype(int)

  # Calculate the value function
  value = np.sum(rewards - actions_int * cost_sw) / len(rewards)

  return value

# Test your function
Well Done!

# Submit your feedback
Coding Exercise 5.2: Run the policy#

Now that you have a mechanism to find out how good a threshold is, we will use a brute force approach to compute the optimal threshold: we’ll just try all thresholds, simulate the value of each, and pick the best one. Complete the function get_optimal_threshold(p_stay, low_rew_p, high_rew_p, cost_sw). We provide the code to visualize the output of your function. Observe on this plot which threshold has maximal utility.

Thinking questions:

  • Try a very high switching cost. What is the best threshold? How does that make sense?

  • Try a zero switching cost. What’s different?

  • Generally, how does the best threshold change with the switching cost?

# Set a large time horizon to calculate meaningful statistics
large_time_horizon = 10000

def run_policy(threshold, p_stay, low_rew_p, high_rew_p):
  This function executes the policy (fully parameterized by the threshold) and
  returns two arrays:
    The sequence of actions taken from time 0 to T
    The sequence of rewards obtained from time 0 to T

  params = [p_stay, low_rew_p, high_rew_p, threshold]
  binaryHMM_test = binaryHMM_belief(params, choose_policy="threshold", T=large_time_horizon)
  _, _, actions, rewards, _ = binaryHMM_test.generate_process()
  return actions, rewards

def get_optimal_threshold(p_stay, low_rew_p, high_rew_p, cost_sw):
    p_stay (float): probability of fish staying in their current location
    low_rew_p (float): probability of catching fish when you and the fist are in different locations.
    high_rew_p (float): probability of catching fish when you and the fist are in the same location.
    cost_sw (float): the cost of switching to the other location

    value (float): expected utility per unit time
  ## 1. Modify the code below to find the best threshold using brute force
  ## 2. After completing the function, comment this line:
  raise NotImplementedError("Student exercise: Please complete the code")

  # Create an array of 20 equally distanced candidate thresholds (min = 0., max=1.):
  threshold_array = ...

  # Using the function get_value() that you coded before and
  # the function run_policy() that we provide, compute the value of your
  # candidate thresholds:

  # Create an array to store the value of each of your candidates:
  value_array = ...

  for i in ...:
    actions, rewards = ...
    value_array[i] = ...

  # Return the array of candidate thresholds and their respective values

  return threshold_array, value_array

# Feel free to change these parameters
stay_prob = .9         # Fish stay at current location with probability stay_prob
low_rew_prob = 0.1     # Even if fish are somewhere else, you can catch some fish with probability low_rew_prob
high_rew_prob = 0.7    # When you and the fish are in the same place, you can catch fish with probability high_rew_prob
cost_sw = .1           # When you switch locations, you pay this cost: cost_sw

# Visually determine the threshold that obtains the maximum utility.
# Remember, policies are parameterized by a threshold on beliefs:
# when your belief that the fish are on your current side falls below a threshold 𝜃, you switch to the other side.
threshold_array, value_array = get_optimal_threshold(stay_prob, low_rew_prob, high_rew_prob, cost_sw)
plot_value_threshold(threshold_array, value_array)

# Solution code

# Set a large time horizon to calculate meaningful statistics
large_time_horizon = 10000

def run_policy(threshold, p_stay, low_rew_p, high_rew_p):
  This function executes the policy (fully parameterized by the threshold) and
  returns two arrays:
    The sequence of actions taken from time 0 to T
    The sequence of rewards obtained from time 0 to T

  params = [p_stay, low_rew_p, high_rew_p, threshold]
  binaryHMM_test = binaryHMM_belief(params, choose_policy="threshold", T=large_time_horizon)
  _, _, actions, rewards, _ = binaryHMM_test.generate_process()
  return actions, rewards

def get_optimal_threshold(p_stay, low_rew_p, high_rew_p, cost_sw):
    p_stay (float): probability of fish staying in their current location
    low_rew_p (float): probability of catching fish when you and the fist are in different locations.
    high_rew_p (float): probability of catching fish when you and the fist are in the same location.
    cost_sw (float): the cost of switching to the other location

    value (float): expected utility per unit time

  # Create an array of 20 equally distanced candidate thresholds (min = 0., max=1.):
  threshold_array = np.linspace(0., 1., 20)

  # Using the function get_value() that you coded before and
  # the function run_policy() that we provide, compute the value of your
  # candidate thresholds:

  # Create an array to store the value of each of your candidates:
  value_array = np.zeros(len(threshold_array))

  for i in range(len(threshold_array)):
    actions, rewards = run_policy(threshold_array[i], p_stay, low_rew_p, high_rew_p)
    value_array[i] = get_value(rewards, actions, cost_sw)

  # Return the array of candidate thresholds and their respective values

  return threshold_array, value_array

# Feel free to change these parameters
stay_prob = .9         # Fish stay at current location with probability stay_prob
low_rew_prob = 0.1     # Even if fish are somewhere else, you can catch some fish with probability low_rew_prob
high_rew_prob = 0.7    # When you and the fish are in the same place, you can catch fish with probability high_rew_prob
cost_sw = .1           # When you switch locations, you pay this cost: cost_sw

# Visually determine the threshold that obtains the maximum utility.
# Remember, policies are parameterized by a threshold on beliefs:
# when your belief that the fish are on your current side falls below a threshold 𝜃, you switch to the other side.
threshold_array, value_array = get_optimal_threshold(stay_prob, low_rew_prob, high_rew_prob, cost_sw)
with plt.xkcd():
  plot_value_threshold(threshold_array, value_array)

# Submit your feedback
In this tutorial, you combined Hidden Markov Models with actions to solve an optimal control problem! This showed us the core formalism of the Partially Observable Markov Decision Process (POMDP).

Using observations (fish caught), you built beliefs (posterior distributions) that helped you estimate where the fish were. Next, you computed a value function that helped you evaluate the quality of different policies. Finally, using a brute force approach, you discovered an optimal policy that allowed you to catch as many fish as possible while minimizing the effort of switching your location.

The following tutorial will use continuous states and actions instead of the binary ones we used here. In continuous control, we can still use a POMDP, but we’ll focus on control in the fully observed case, a Markov Decision Process (MDP), since the policy is still illuminating.

Video 6: From discrete to continuous control#

# Submit your feedback
Bonus Section 1: How does the optimal policy depend on the task?#

Video 7: Sensitivity of optimal policy#

# Submit your feedback
Bonus Interactive Demo 1: Explore task parameters#

In this demo, you can play with various task parameters. Observe how the optimal threshold changes when you adjust:

  • The switching cost

  • The fish dynamics (p(stay))

  • The probability of catching fish on each side, p(high_rwd) and p(low_rwd)

Can you explain why the optimal threshold changes with these parameters:

  • lower/higher switching cost?

  • faster fish dynamics (i.e., low p_stay)?

  • rarer fish caught (i.e., low p(high_rwd) and low p(low_rwd))?

Note that it may require long simulations to see subtle changes in values of different policies, so look for coarse trends first.

Hide code cell source
@widgets.interact(p_stay=widgets.FloatSlider(.95, description="p(stay)",
                                             min=0., max=1.),
                  p_high_rwd=widgets.FloatSlider(.4, description="p(high_rwd)",
                                                 min=0., max=1.),
                  p_low_rwd=widgets.FloatSlider(.1, description="p(low_rwd)",
                                                min=0., max=1.),
                  cost_sw=widgets.FloatSlider(.2, description="switching cost",
                                              min=0., max=1., step=.01))

def update_ex_bonus(p_stay, p_high_rwd, p_low_rwd, cost_sw):
    p_stay: probability fish stay
    high_rew_p: p(catch fish) when you're on their side
    low_rew_p : p(catch fish) when you're on other side
    cost_sw: switching cost

  # Set a large time horizon to calculate meaningful statistics
  large_time_horizon = 10000

  threshold_array, value_array = get_optimal_threshold(p_stay,
  plot_value_threshold(threshold_array, value_array)
# Explanation

* lower/higher switching cost?

  High switching cost means that you should be more certain that the other side
  is better before committing to change sides. This means that beliefs must fall
  below a threshold before acting. Conversely, a lower switching cost allows you
  more flexibility to switch at less stringent thresholds. In the limit of _zero_
  switching cost, you should always switch whenever you think the other side is
  better, even if it's just 51%, and even if you switch every time step.

* faster fish dynamics (i.e., low p_stay)?

  Faster fish dynamics (lower `p_stay`) also promote faster switching because
  you cannot plan as far into the future. In that case you must base your decisions
  on more immediate evidence, but since you still pay the same switching cost that
  cost is a higher fraction of your predictable rewards. Thus, you should be more
  conservative and switch only when you are more confident.

* rarer fish caught (i.e., low p(high_rwd) and low p(low_rwd))?

  When `high_rew_p` and/or `low_rew_p` decreases, your predictions become less reliable,
  again encouraging you to require more confidence before committing to a switch.

