StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I would recommend using Q-learning for your implementation.</p> <p>Maybe you can use this post I wrote as an inspiration. This is a <a href="http://kunuk.wordpress.com/2010/09/24/q-learning/" rel="nofollow noreferrer">Q-learning demo with Java source code</a>. This demo is a map with 6 fields and the AI learns where it should go from every state to get to the reward.</p> <blockquote> <p>Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.</p> <p>This example shows the Q-learning used for path finding. A robot learns where it should go from any state.</p> <p>The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).</p> <p>In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).<br> The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.</p> <p><a href="https://i.stack.imgur.com/SXoYM.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/SXoYM.png" alt="q-learn1"></a></p> <p>Result Q*(s,a)<br> <a href="https://i.stack.imgur.com/ApJn2.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/ApJn2.png" alt="q-learn2"></a></p> <p>Policy Π*(s)<br> <a href="https://i.stack.imgur.com/JAtne.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/JAtne.png" alt="q-learn3"></a></p> <p>Qlearning.java</p> <pre><code>import java.text.DecimalFormat; import java.util.Random; /** * @author Kunuk Nykjaer */ public class Qlearning { final DecimalFormat df = new DecimalFormat("#.##"); // path finding final double alpha = 0.1; final double gamma = 0.9; // states A,B,C,D,E,F // e.g. from A we can go to B or D // from C we can only go to C // C is goal state, reward 100 when B->C or F->C // // _______ // |A|B|C| // |_____| // |D|E|F| // |_____| // final int stateA = 0; final int stateB = 1; final int stateC = 2; final int stateD = 3; final int stateE = 4; final int stateF = 5; final int statesCount = 6; final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF}; // http://en.wikipedia.org/wiki/Q-learning // http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a)) int[][] R = new int[statesCount][statesCount]; // reward lookup double[][] Q = new double[statesCount][statesCount]; // Q learning int[] actionsFromA = new int[] { stateB, stateD }; int[] actionsFromB = new int[] { stateA, stateC, stateE }; int[] actionsFromC = new int[] { stateC }; int[] actionsFromD = new int[] { stateA, stateE }; int[] actionsFromE = new int[] { stateB, stateD, stateF }; int[] actionsFromF = new int[] { stateC, stateE }; int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC, actionsFromD, actionsFromE, actionsFromF }; String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" }; public Qlearning() { init(); } public void init() { R[stateB][stateC] = 100; // from b to c R[stateF][stateC] = 100; // from f to c } public static void main(String[] args) { long BEGIN = System.currentTimeMillis(); Qlearning obj = new Qlearning(); obj.run(); obj.printResult(); obj.showPolicy(); long END = System.currentTimeMillis(); System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec."); } void run() { /* 1. Set parameter , and environment reward matrix R 2. Initialize matrix Q as zero matrix 3. For each episode: Select random initial state Do while not reach goal state o Select one among all possible actions for the current state o Using this possible action, consider to go to the next state o Get maximum Q value of this next state based on all possible actions o Compute o Set the next state as the current state */ // For each episode Random rand = new Random(); for (int i = 0; i < 1000; i++) { // train episodes // Select random initial state int state = rand.nextInt(statesCount); while (state != stateC) // goal state { // Select one among all possible actions for the current state int[] actionsFromState = actions[state]; // Selection strategy is random in this example int index = rand.nextInt(actionsFromState.length); int action = actionsFromState[index]; // Action outcome is set to deterministic in this example // Transition probability is 1 int nextState = action; // data structure // Using this possible action, consider to go to the next state double q = Q(state, action); double maxQ = maxQ(nextState); int r = R(state, action); double value = q + alpha * (r + gamma * maxQ - q); setQ(state, action, value); // Set the next state as the current state state = nextState; } } } double maxQ(int s) { int[] actionsFromState = actions[s]; double maxValue = Double.MIN_VALUE; for (int i = 0; i < actionsFromState.length; i++) { int nextState = actionsFromState[i]; double value = Q[s][nextState]; if (value > maxValue) maxValue = value; } return maxValue; } // get policy from state int policy(int state) { int[] actionsFromState = actions[state]; double maxValue = Double.MIN_VALUE; int policyGotoState = state; // default goto self if not found for (int i = 0; i < actionsFromState.length; i++) { int nextState = actionsFromState[i]; double value = Q[state][nextState]; if (value > maxValue) { maxValue = value; policyGotoState = nextState; } } return policyGotoState; } double Q(int s, int a) { return Q[s][a]; } void setQ(int s, int a, double value) { Q[s][a] = value; } int R(int s, int a) { return R[s][a]; } void printResult() { System.out.println("Print result"); for (int i = 0; i < Q.length; i++) { System.out.print("out from " + stateNames[i] + ": "); for (int j = 0; j < Q[i].length; j++) { System.out.print(df.format(Q[i][j]) + " "); } System.out.println(); } } // policy is maxQ(states) void showPolicy() { System.out.println("\nshowPolicy"); for (int i = 0; i < states.length; i++) { int from = states[i]; int to = policy(from); System.out.println("from "+stateNames[from]+" goto "+stateNames[to]); } } } </code></pre> <p>Print result</p> <pre class="lang-none prettyprint-override"><code>out from A: 0 90 0 72,9 0 0 out from B: 81 0 100 0 81 0 out from C: 0 0 0 0 0 0 out from D: 81 0 0 0 81 0 out from E: 0 90 0 72,9 0 90 out from F: 0 0 100 0 81 0 showPolicy from a goto B from b goto C from c goto C from d goto A from e goto B from f goto C Time: 0.025 sec. </code></pre> </blockquote>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload