Tuesday, October 30, 2018

Let's read: Sutton's RL, week 2, chap 3

For week 2, we will do:
  • Read the definition given for artificial intelligence in Wikipedia and in the Nilsson book on p13; 
  • google for and read “John McCarthy basic questions”, “the intentional stance (dictionary of philosophy of mind)”
  • Sutton & Barto Chapter 3 to Section 3.5

Reading

From Wikipedia
intelligence demonstrated by machines, ... "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, ... a machine [that] mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving".
From The Quest for Artificial Intelligence (2009), Nils Nilsson:
intelligence is that quality that enables an entity to function appropriately and with foresight in its environment... Because “functioning appropriately and with foresight” requires so many different capabilities, depending on the environment, we actually have several continua of intelligences with no particularly sharp discontinuities in any of them. For these reasons, I take a rather generous view of what constitutes AI.
From John McCarthy's "Basic Questions":
Intelligence is the computational part of the ability to achieve goals in the world... we cannot yet characterize in general what kinds of computational procedures we want to call intelligent. 
My own opinion is that the computers of 30 years ago were fast enough if only we knew how to program them.
This is interesting, very interesting... We know that Turing machines of even small complexity can have high complexities -- such as a 8000 state machine that checks the consistency of ZFC, and thus, its halting problem is unsolvable in ZFC (assuming ZFC is consistent). Who knows what they are capable of? If LISP machines of 1980s were frozen for eternity, maybe artificial intelligence would arise from that after two thousand years of hacking.

Biological beings arose from dense coding on the DNA, done by nature, the stupid and patient hacker.
Unfortunately, the competitive and commercial aspects of making computers play chess have taken precedence over using chess as a scientific domain. It is as if the geneticists after 1910 had organized fruit fly races and concentrated their efforts on breeding fruit flies that could win these races.
I searched and it turned out indeed nobody bothered racing or breeding Drosophila for fun... Good!
Sooner or later, AI research will overcome this scandalous weakness [playing Go]
He should update this now! Oh wait, he died back in 2012.
The philosopher Hubert Dreyfus says that AI is impossible. The computer scientist Joseph Weizenbaum says the idea is obscene, anti-human and immoral.
"Obscene"? Turns out it's from Computer Power and Human Reason (1976), Joseph Weizenbaum, and something that John McCarthy criticized back then. For some context, Weizenbaum wrote ELIZA, a program that does basic Rogerian psychotherapy. He was very disturbed by the fact that it was very popular and people treated ELIZA like a real person, and even more by Kenneth Colby's proposal to make psychotherapist programs.
There are, however, two kinds of computer applications that either ought not to be undertaken at all, or, if they are contemplated, should be approached with utmost caution... The first kind I would call simply obscene. These are ones whose very contemplation ought to give rise to feelings of disgust in every civilized person.
The proposal I have mentioned, that an animal's visual system and brain be coupled to computers, is an example... I would put all projects that propose to substitute a computer system for a human function that involves interpersonal respect, understanding, and love in the same category. I therefore reject Colby's proposal that computers be installed as psychotherapists, not on the grounds that such a project might be technically infeasible, but on the grounds that it is immoral.
From "The Intentional Stance":
According to Daniel Dennett, there are three different strategies that we might use when confronted with objects or systems: the physical stance, the design stance, and the intentional stance... We use them to predict and thereby to explain the behavior of the entity in question... 
Three ways to think about a chess computer:

Physical: a pile of intricate metal and plastic with electricity inside.
Design: a system designed to play chess
Intentional: a player who wants to win in chess games

The three strategies are different in accuracy and ease of use. Intentional is very easy to use but likely to be inaccurate when the lower levels fail; physical is very hard to use but very accurate.
If the computer suddenly started behaving in a manner inconsistent with something a reasonable chess player would do, we might have to adopt the design stance. In other words, we might have to look at the particular chess-playing algorithm implemented by the computer in order to predict what it will subsequently do. And in cases of more extreme malfunction—for example, if the computer screen were suddenly to go blank and the system were to freeze up—we would have to revert to thinking of it as a physical object to explain its behavior adequately.

Basic structure of a RL agent, and MDP framework


An RL agent receives rewards and observations and gives actions. The Markov Decision Process models the environment as a Markov chain. For non-Markov environments (such as when past actions matter), MDP still gives a good approximation sometimes.

An MDP is defined by this fuction:
\[p(s', r | s, a) = p(\text{next state}, \text{reward}| \text{current state}, \text{action})\]
and is usually drawn in a picture like a finite state machine:

An RL agent is drawn like in control theory:

Note that the reward that follows $A_t$ is $R_{t+1}$, which might cause confusion.
It is typical of reinforcement learning tasks to have states and actions with such structured representations [like vectors]. Rewards, on the other hand, are always single numbers.
This caught my eye. Why must rewards be real numbers?

Why must rewards be reals?

  1. Presumably, it's because rewards must be added and ranked. We can try to generalize. Let $R$ be the set of rewards.
  2. To add rewards, it must be a monoid. The unit element $0$ symbolizes for the innocence before any reward. 
  3. The commutativity of addition of rewards symbolizes the time-symmetry of desire. A cake eaten in 2001 and a cake eaten in 2002 should be equally good for an agent. Desires change of course, but we'll add the complication later in a more hacky way. Satisfactions are thus formalized as eternal.
  4. The possibility of taking inverse is less easy to justify. Must every reward have a corresponding punishment? It seems possible, at least. For every reward I can think of, there is a thinkable opposite that is just as bad, such that their combination is precisely worthless. A heaven and a hell are imagined together like a shadow and a light. So, we get a commutative group.
  5. Some rewards are better than others, any two rewards are comparable, and there are no strictly "cyclic" rewards. So we get a linear preorder on the rewards. As usual, we take the quotient to get a linear order.
  6. The number of rewards we need to consider is countable. Reasonable, since nothing needs to be uncountable in this universe.
And finally by a standard theorem in order theory:

Theorem: The real line is the largest order separable linearly ordered set. That is, if $(L,\preceq)$ is a linearly ordered set such that a countable set $C\subseteq L$ exists satisfying that for all $x,y\in L$ with $x\prec y$ there is $c\in C$ with $x\preceq c\preceq y$, then $L$ is embedded in $\mathbb{R}$. (The existence of such a set $C$ is also necessary, see this)

Which gives the easy Corollary: Any countable linearly ordered set is embedded in the reals.

Thus we have proven that any "reasonable" reward set is embedded in the reals, and thus there's no need to consider other reward sets unless you want to break one of the assumptions.

Back from the madness of set theory

Ex 3.1
  • Racing car. 
  • States: position, velocity, direction, engine speed, fuel, etc. 
  • Reward: ranked as the final rank. 
  • Actions: wheel angle, gear ratio, fuel power, etc.

  • Pregnancy, as viewed by the hypothalamus. 
  • States: health of the fetus sensed by the nerves in the uterus. 
  • Reward: a healthy living fetus = 1, a sick living fetus = 0.5, a dead fetus = 0. 
  • Actions: uterine temperature, blood sugar, blood flow through the placenta, concentration of leptin, oxytocin, feelings of fear in the host, etc.

  • Theorem proving. 
  • States: the proof so far. 
  • Reward: prove the theorem = 1, not prove the theorem = 0. 
  • Actions: add new implications to the proof.

Ex 3.2:  No. For example, if the environment is a Markov process, but with the transition probabilities changing, with nodes coming and going. And playing games against an intelligent opponent (instead of against nature).

Ex. 3.3: Philosophically, they are all possible, just some would lead to harder programs. For a human trying to drive, the most natural and productive way to divide is divide between the body and the car, so the actions are the motions of the arms and legs and such. The reason for the usefulness is that these concepts are the most natural and controllable parts in the world, from the view of a human.

Ex. 3.4: Easy.

Rewards, Returns, Episodes

The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved. 
Another warning about how to prevent wireheading and other perverse instantiation possibilities.

The expected future reward at time $t$ is
\[G_t = \sum_{i=0}^{\infty} \gamma^i R_{t+1 + i}\]
where $\gamma \in [0, 1]$ is the discount rate.

It's easier to consider continuing tasks since there's no summation upper bound to bother with. 
For episodes, instead of terminating, make the terminating state ("death") have only one action: "stay dead", which leads back to "death" and has reward 0. Thus episodes are just a special kind of continuing tasks.

Alternatively, we can write 
\[G_t = \sum_{i=t+1}^{T} \gamma^i R_{i}\]
with $T \in \mathbb{N}$ the length of the episode, with continuing tasks considered as an episode that has $T = \infty$.

Ex. 3.5: ??? I don't see any need for modification.

Ex. 3.6: In the case of continuing, discounting, we have
\[G_t = -\gamma^K -\gamma^{K+1} - \cdots  = -\frac{\gamma^K}{1-\gamma}\]while in the case of episodic, discounting, we stop as soon as the pole drops:
\[G_t = -\gamma^K \]There's no real difference, just a scaling.

Ex. 3.7: The robot is rewarded for escaping, not for speed. To make it go faster, inflict PAIN on it for every second it wastes in the maze. Lash at the worthless creature and make it wish it was never born just to suffer--
I mean, use negative reward to encourage it to go faster. For every step in the maze, give reward $-1$.
 Alternatively, use discounting works too. The agent would prefer to get the reward earlier than later, because it finds a far off reward less rewarding than an immediate reward, even if when it finally reaches it, it'd feel the same to it.

Ex. 3.8: $G_t = 2, 6, 8, 4, 2, 0$

Ex. 3.9: $G_0 = 65, G_1 = 70$

Ex. 3.10: Obvious.

Policies and Value functions

A policy is of form $\pi (a | s) = Pr(\text{next action} | \text{current state})$. It obviously have to satisfy $\forall s, \sum_a \pi(a | s) = 1$.

Ex. 3.11:
\[\mathbb{E}_\pi [R_{t+1} | S_t = s]) = \sum_a \pi(a|s) \sum_{s', r} r\cdot p(s', r | s, a)\]

Given a policy $\pi$, we can define the state-value function and the related state-action-value function for the policy:
\[v_\pi(s) = \mathbb{E}_\pi [G_t | S_t = s], q_\pi(s, a)[G_t | S_t = s, A_t = a]\]


Ex. 3.12, 3.13:
\[v_\pi(s) = \sum_a \pi(a | s) q_\pi (s, a)\]
\[q_\pi(s, a) = \sum_{s', r} p(s', r | s, a)\left[ r + \gamma \cdot v_\pi (s') \right] \]

Combining them, we obtain the Bellman Equation:
\[v_\pi(s) = \sum_{a, s', r}  \pi(a | s) p(s', r | s, a)\left[ r + \gamma \cdot v_\pi (s') \right] \]

Given policy $\pi$, solving for value functions is it's a matter of solving a linear equation. If it's too hard to calculate explicitly, it's possible to "backup" the values by repeatedly updating $v_\pi$ using the Bellman equation, until it converges. In other words, the true $v_\pi$ is a fixed point of the Bellman equation.

Similarly, we have the Bellman equations for $q_\pi$:
\[q_\pi(s, a) = \sum_{s', r} p(s', r | s, a)\left[ r + \gamma \cdot \sum_{a'} \pi(a' | s') q_\pi (s', a') \right] \]

Ex. 3.14: Boring.
Ex. 3.15: $v_c = \frac{c}{1-\gamma}$
Ex. 3.16: It would affect profoundly the task. If $c > 0$, it would make the maze a nice place to stay and the robot would simply sit tight and reap the rewards. If $c < 0$, the robot would try to get out as fast as possible. The problem is that there's no discount.
Ex. 3.17, 18, 19: see before.

No comments:

Post a Comment

Let's Read: Neuropath (Bakker, 2009)

Neuropath  (Bakker 2009) is a dramatic demonstration of the eliminative materialism worldview of the author R. Scott Bakker. It's very b...