Note that there are some explanatory texts on larger screens.

plurals
  1. POMarkov Decision Process: value iteration, how does it work?
    primarykey
    data
    text
    <p>I've been reading a lot about <a href="http://www.cs.ubc.ca/~poole/demos/mdp/vi.html">Markov Decision Processes (using value iteration)</a> lately but I simply can't get my head around them. I've found a lot of resources on the Internet / books, but they all use mathematical formulas that are way too complex for my competencies. </p> <p>Since this is my first year at college, I've found that the explanations and formulas provided on the web use notions / terms that are way too complicated for me and they assume that the reader knows certain things that I've simply never heard of.</p> <p>I want to use it on a 2D grid (filled with walls(unattainable), coins(desirable) and enemies that move(which must be avoided at all costs)). The whole goal is to collect all the coins without touching the enemies, and I want to create an AI for the main player using a Markov Decision Process (<em>MDP</em>). Here is how it partially looks like (note that the game-related aspect is not so much of a concern here. I just really want to understand <em>MDPs</em> in general):</p> <p><img src="https://i.stack.imgur.com/1hrZ5.png" alt="enter image description here"></p> <p>From what I understand, a rude simplification of <em>MDPs</em> is that they can create a grid which holds in which direction we need to go (kind of a grid of "arrows" pointing where we need to go, starting at a certain position on the grid) to get to certain goals and avoid certain obstacles. Specific to my situation, that would mean that it allows the player to know in which direction to go to collect the coins and avoid the enemies.</p> <p>Now, using the <em>MDP</em> terms, it would mean that it creates a collection of states(the grid) which holds certain policies(the action to take -> up, down, right, left) for a certain state(a position on the grid). The policies are determined by the "utility" values of each state, which themselves are calculated by evaluating how much getting there would be beneficial in the short and long term.</p> <p>Is this correct? Or am I completely on the wrong track?</p> <p>I'd at least like to know what the variables from the following equation represent in my situation:</p> <p><img src="https://i.stack.imgur.com/pMoVo.png" alt="U_{i+1}(s) \longleftarrow R(s) + \gamma \max \sum\limits_{s&#39;} T(s,a,s&#39;) U_i (s&#39;) \,."></p> <p>(taken from the book "Artificial Intelligence - A Modern Approach" from Russell &amp; Norvig)</p> <p>I know that <code>s</code> would be a list of all the squares from the grid, <code>a</code> would be a specific action (up / down / right / left), but what about the rest?</p> <p>How would the reward and utility functions be implemented?</p> <p>It would be really great if someone knew a simple link which shows pseudo-code to implement a basic version with similarities to my situation in a very slow way, because I don't even know where to start here.</p> <p>Thank you for your precious time.</p> <p><em>(Note: feel free to add / remove tags or tell me in the comments if I should give more details about something or anything like that.)</em></p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload