This past week I worked on an my implementation of Q-learning, a method of reinforcement learning which converges quickly and does not depend on any knowledge of the environment. I plan to use Q-learning to model working memory, where the agent learns whether to replace or retain the contents of working memory based on past experience.
On Tuesday, I presented on the concept of Q-learning during lab meeting and discussed ways in which Q-learning might help with modeling of working memory. One possible drawback of this application is the fact that Q-learning has no initial assumptions about a system whereas it seems reasonable that working memory might employ certain assumptions about the environment. Since we have chosen to model the learning process this might cause some discrepancy between the behavior of our model and the behavior of working memory.
Before I begin working on the working memory model, I want to ensure that my implementation is working properly and develop my intuition for Q-learning. I finished implementing a more generalized Q-learning algorithm and applying it to a random walk Markov decision process to confirm the algorithm's accuracy and build my intuition. For an example of a random walk process, take a look at figure 6.5 in this link. In my program, there are 5 states, of which either end state (states 1 and 5) is terminal. The rightmost state (state 5) has a reward of 1; all other states have a reward of 0. The agent begins each episode at state 3, and moves either left (action 1) or right (action 2) until reaching a terminal state. With each action taken, the Q-value for the state, action pair is re-evaluated according to the following formula:
Q(s,a) = Q(s,a) + α[r + γ Q(s',a'*) - Q(s,a)]
where a'* represents the best possible action at the next state, s', α represents a learning rate, γ represents a discount factor on the future reward, r represents the immediate reward upon reaching the next state, and Q(s, a) is the estimated value of a state action pair. Several episodes are performed and the algorithm repeatedly updates Q(s,a) values.
Given this formula and a high gamma value, we can expect that taking action 1 (left), from state 2 would guarantee zero future reward and hence Q(state 2, action 1) would equal zero. Similarly, we can expect that Q(state 4, action 2) would equal 1, because taking action2 in state 4 guarantees a reward of 1.
After much debugging, my algorithm outputs results that look reasonable:
('s_1', 'a_1') 0
('s_3', 'a_1') 0.727317242186
('s_1', 'a_2') 0
('s_4', 'a_2') 1.0
('s_2', 'a_1') 0.0
('s_4', 'a_1') 0.809658069053
('s_3', 'a_2') 0.9
('s_5', 'a_1') 0
('s_5', 'a_2') 0
('s_2', 'a_2') 0.809523722886
Once the Q-learning algorithm was confirmed successful, I began to work on the first iteration of the working memory (WM) model.
This model is based on the experiments of Collins and Frank (2012)1, in which a subject is presented with a series of stimuli (pictures) and asked to respond by pressing one of three keys for each picture. Upon their response, a tone indicates whether they are correct or incorrect. Stimuli are repeated so that the subject can learn the proper responses.
In our first model, we consider a system with 2 stimuli and a working memory capacity of 1. Alpha represents the probability of the presented stimulus switching between trials. A state is composed of the stimulus presented and the stimulus in memory. At each time step, the agent selects between 2 actions: to maintain the current contents of working memory, or to replace the contents with the recently presented stimulus. The agent will learn an optimal policy depending on the value of alpha.
On Friday I met with Paul and team Lifelong Learning-- Mike, Deniz, and Subhash. Subhash presented on System Identification, a field which examines techniques for building models from interaction with an environment. System Identification is particularly useful when dealing with time variant systems, that is, systems whose output is explicitly dependent on time. I speculate that System Identification might be an interesting way to approach the question of working memory's versatility. Perhaps implementation of system identification techniques might explain how working memory is able to perform diverse tasks like reading comprehension, mental math, and list memorization.
Next week I hope to finish this first iteration, and create a second iteration with a higher memory capacity and an ability to forget encoded into the transition probabilities. I plan to analyze the output of my models and produce graphs in ipython notebook. Until then!