인공 지능에 한 분야로 컴퓨터가 스스로 현재 상태를 인지하고, 선택가능한 행동들 중 보상이 가장 크게 예측되는 행동을 하게 된다.
OpenAI GYM과 Tensorflow환경에서 Q-learning과 Dq learning 등의 알고리즘의 구현을 실습해보려 합니다.
Sung Kim교수님의 인터넷 강의를 수업을 참고하여 작성하였습니다.
What is the reinforcement learning?
Basic idea: We can learn from past experiences.
Objects
- Environment
- Actor
Basic rule
- Actor’s action can change the environment.
- After a action, observation(state) is changed.
- After actions, Actor can get the reward.
Environment
-
Python
-
Tensorflow
sudo apt-get install python-pip python-dev
pip install tensorflow
orpip install tensorflow-gpu
-
OpenAI Gym
sudo apt install cmake
apt-get install lib g-dev
sudo -H pip install gym
sudo -H pip install gym[atari]
OpenAI GYM
A toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go. - OpenAI
import gym
env = gym.make("FrozenLake-v0")
observation = env.reset() #environment reset
for _ in range(1000):
env.render() # Show the environment
action = env.action_space.sample() # Your agent here
observation, reward, done, info = env.step(action)
# observation, reward
# doen whether the game is over.
# info additional info
import gym
from gym.envs.registration import register
import sys,tty,termios
env = gym
Q-Learning
Agent don’t know which action is right and good for self. If agent can ask the action to someone, it will be helpful. Q can answer this question.
Q function
Also, it can be called “state action value function”
Q can reply the question from the agent, In the present state, do any action. you might get the reward(quality)
Input parameter
- state
- action
Output parameter
- Quality(reward)
Policy
$Q$(state, action)
Optimal policy with Q
\[\max Q = \max_{a'}Q(s1,a')\\ \pi^*(s) = \arg\max_{a}Q(s,a)\]- Find MAX rewad by action.
- Select argument for MAX reward.
Learning Q
Assum $Q$ in $s’$ exists
-
I am in $s$
-
when i do action $a$, I’ll go to $s’$
-
when i do action $a$, I’ll get to $r$
-
$Q$ in $s’$, $Q(s’,a’)$ exist
Algorithm
For each $s,a$ initialize table entry $\hat Q(s,a) \leftarrow 0$
Observe current state $s$
Do forever:
-
Select an action $a$ and execute it
-
Receive immediate reward $r$
-
Observe the new state $s’$
-
Update the table entry for $\hat Q(s,a)$ as follows:
\[\hat Q(s,a) \leftarrow r + \max_{a'}(s',a')\] -
$s \leftarrow s’$
Example - Frozen lake
Simple game for going from start to goal with avoiding hole.
S | F | F | F |
---|---|---|---|
F | H | F | H |
F | F | F | H |
H | F | F | G |
S = Start point H = Hole
G = Goal F=Path
First environment
S(A) | 0 | 0 | 0 |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
Final environment (Example)
S(A) | R | D | L |
---|---|---|---|
D | -1 | D | -1 |
R | D | D | -1 |
-1 | R | R | 1 |
U= Up D = Down
L= Left R=Right
Dummy Q-learning (Python)
What is the problem in dummy Q-learning?
Exploit vs Exploration
Exploit: Visit to somewhere have never been to before
Exploration: Visit to the reasonable way.
How to solve the problem?
E-greeedy
e = 0.1
if random < e:
a = random
elif:
a = argmax(Q(s,a))
But after many step, we don’t need to exploit many times.
decaying E-greedy
for i in range(1000):
e = 0.1/(i+1)
if random < e:
a = random
elif:
Add random noise
a = argmax((Qs,a) + random_values)
for i in range(1000):#decaying
a = argmax((Qs,a) + random_values/(i+1))
Using radom values, the selection will be changed some times.
The difference between E-greedy and Add random noise.
E-greedy select the random value in stead of the best way in case the random value is smaller than e value.
Add random noise method has high probability to select second or third best value even if the best one is decreased by noise.
Discounted reward
It is similar to the Depreciation
in the economy. The best is to get the reward now than future.
Q-learning equation
\[\hat Q(s,a) \leftarrow r + \gamma \max_{a'} \hat Q (s',a')\\ \gamma:\text{discount ratio}\\ \hat Q : \text{approximation value for }Q\\\]$\hat Q$ converges to $Q$.
- In deterministic words
- In finite states