I am developing an AI that can play TicTacToe using Reinforcement Q learning
So I recenly took an udemy course part of which reinforcement learning was included as part of a broad AI course. So based on what I learnt about Q Learning(a type of reinforcement learning, under stereotypical Artificial Intelligence) , even though the implementation was not discussed in any full details, I had to go through a provided book to understand how it’s implemented in practice. And YouTube was not helping matters.
I need any additional opinions I can get from you, thanks.
Now in order for me to apply Q Learning to TicTacToe, I have to make the AI (Agent)always play X (makes the first move) for simplicity in my AI software development.
Q Learning algorithm is based on Bellmans Equation. Literally it’s about rewarding Favourable actions taken at different states and punishing unfavorable actions taken at different states also. Rewarding and punishing are one and the same number(variable) in the Bellmans equation. There is also the learning rate and discount factors, which are both variables too in the Bellmans equation.
For every action the agent takes at every state it always get a tiny reward (positive or negative), then after winning or losing it gets a much larger reward (positive or negative).
So how does the agent remembers all it has learnt, by looking at the q table,and checking for the action with the highest Q value. These Q values are updated by Bellmans equation whether positively or negatively.
Now the first challenge is I have to pair all the possible valid board configurations (states space) to all the possible valid actions that can be taken for each states in a Q table dictionary, and map all the pairs to Q values of 0 (for initialization). I will write a Python code that will generate this mapping and remove all impossible states (where X is greater than 0 by more than 1 is definitely invalid). Also make Q values for actions where ever is occupied by either X or O as – 1.0 to prevent agent from making such moves.
I will make 4 different players in different classes of the game software who would play with the agent at different stages of it’s learning automatically to update the Q values of every actions taken by agent in each game state, instead of waiting for the final result before updating Q value(my initial mistake when I was still learning about Q Learning) .
Below is for any of the 5 agents (Balanced, Quick Myopic, Quick Overplanner, Slow Myopic and Slow Overplanner) selected at the start of the training games. These agents have different combinations of hyperparameters (learning rate and discount factor)
To train the agent I will make it play against 4 heuristics players (players programed to play only in a certain way) all using different playing strategies.
For starting stage, agent will play with random player 1 for 2000 iterations of games and update it’s Q values for all the state action pairs it encountered.
Then for the next 2000 iterations, agent will play with a player 2 that always favor the center if available for its first move otherwise plays at any corner piece, otherwise plays any random available space.
Then for the next 2000 iters, agent will play against a player 3 that plays randomly until Agent is about to make a winning move, then blocks it. Not really trying to move, just blocking agent winning moves.
Then for the next 2000 iters, agent will play with a player 4 that tries to complete a line as soon as possible, by playing in corners that are impossible for agent to block, that is playing in triangular corners that leads to a definite win if agent doesn’t win on time.
Now create separate classes (.NET MAUI) for these four players that would train a selected agent chosen with options to pick the desired iterations of games for the training of the agent with that player.
For the reward system, +1 reward on completing a line. – 1 for allowing opponent win and not blocking it. +0. 5 for playing a position that can lead to a win in its next move. – 0.5 by playing in a position that cannot lead to a win in its next move (that is a move that doesn’t form a straight line of three with an empty cell anywhere in the line) and – 0.5 for playing along a line that is played by an opponent. This is the reward system rules.
So bells formula would be used for updating the Q value for every action taken for every state in the Q Table already defined in their respective Q Table json file for the particular agent being trained. We would use both learning rate, discount factor and reward from reward system for every action taken at every state.
Below is the link to the Q table (the brain of the AI);
https://www.kaggle.com/datasets/adedapoadeniran/tictactoe-q-learning-table
And below is the link to the code that generated the Q table.
https://www.kaggle.com/code/adedapoadeniran/reinforcement-learning-for-tictactoe-ai/
Thanks for your attention.
This was really mentally tasking to come up with and figure out.
So I recenly took an udemy course part of which reinforcement learning was included as part of a broad AI course. So based on what I learnt about Q Learning(a type of reinforcement learning, under stereotypical Artificial Intelligence) , even though the implementation was not discussed in any full details, I had to go through a provided book to understand how it’s implemented in practice. And YouTube was not helping matters. I need any additional opinions I can get from you, thanks. Now in order for me to apply Q Learning to TicTacToe, I have to make the AI (Agent)always play X (makes the first move) for simplicity in my AI software development.Q Learning algorithm is based on Bellmans Equation. Literally it’s about rewarding Favourable actions taken at different states and punishing unfavorable actions taken at different states also. Rewarding and punishing are one and the same number(variable) in the Bellmans equation. There is also the learning rate and discount factors, which are both variables too in the Bellmans equation.For every action the agent takes at every state it always get a tiny reward (positive or negative), then after winning or losing it gets a much larger reward (positive or negative).So how does the agent remembers all it has learnt, by looking at the q table,and checking for the action with the highest Q value. These Q values are updated by Bellmans equation whether positively or negatively. Now the first challenge is I have to pair all the possible valid board configurations (states space) to all the possible valid actions that can be taken for each states in a Q table dictionary, and map all the pairs to Q values of 0 (for initialization). I will write a Python code that will generate this mapping and remove all impossible states (where X is greater than 0 by more than 1 is definitely invalid). Also make Q values for actions where ever is occupied by either X or O as – 1.0 to prevent agent from making such moves. I will make 4 different players in different classes of the game software who would play with the agent at different stages of it’s learning automatically to update the Q values of every actions taken by agent in each game state, instead of waiting for the final result before updating Q value(my initial mistake when I was still learning about Q Learning) . Below is for any of the 5 agents (Balanced, Quick Myopic, Quick Overplanner, Slow Myopic and Slow Overplanner) selected at the start of the training games. These agents have different combinations of hyperparameters (learning rate and discount factor)To train the agent I will make it play against 4 heuristics players (players programed to play only in a certain way) all using different playing strategies.For starting stage, agent will play with random player 1 for 2000 iterations of games and update it’s Q values for all the state action pairs it encountered.Then for the next 2000 iterations, agent will play with a player 2 that always favor the center if available for its first move otherwise plays at any corner piece, otherwise plays any random available space.Then for the next 2000 iters, agent will play against a player 3 that plays randomly until Agent is about to make a winning move, then blocks it. Not really trying to move, just blocking agent winning moves.Then for the next 2000 iters, agent will play with a player 4 that tries to complete a line as soon as possible, by playing in corners that are impossible for agent to block, that is playing in triangular corners that leads to a definite win if agent doesn’t win on time. Now create separate classes (.NET MAUI) for these four players that would train a selected agent chosen with options to pick the desired iterations of games for the training of the agent with that player.For the reward system, +1 reward on completing a line. – 1 for allowing opponent win and not blocking it. +0. 5 for playing a position that can lead to a win in its next move. – 0.5 by playing in a position that cannot lead to a win in its next move (that is a move that doesn’t form a straight line of three with an empty cell anywhere in the line) and – 0.5 for playing along a line that is played by an opponent. This is the reward system rules. So bells formula would be used for updating the Q value for every action taken for every state in the Q Table already defined in their respective Q Table json file for the particular agent being trained. We would use both learning rate, discount factor and reward from reward system for every action taken at every state. Below is the link to the Q table (the brain of the AI);https://www.kaggle.com/datasets/adedapoadeniran/tictactoe-q-learning-tableAnd below is the link to the code that generated the Q table.https://www.kaggle.com/code/adedapoadeniran/reinforcement-learning-for-tictactoe-ai/Thanks for your attention.This was really mentally tasking to come up with and figure out. Read More