Saturday, October 27, 2018

Let's read: Sutton's RL, week 1, chap 1

This is the first post in a series of posts as I read through Richard Sutton's Reinforcement Learning: An Introduction (2nd edition, 2018), which is freely available on Sutton's site, thanks to his philosophy of GNU.

We will follow the textbook and do some assignments as seen from the folder, following the course schedule. The course schedule gives it as a 10 week course though we won't go through them all.

And a warning: I wrote this as review, so you'd better read the book yourself, since I will only write the bare minimum needed and not give more explanations. I will also post my exercise solutions.

Week 1 tasks




Readings

Wikipedia entry

Basic stuff. Side note: Yudkowski proposed three related but different meanings of the word "singularity": 
  1. exponential growth, that technology grows exponentially and by fitting a curve, we can predict that AI will exceed humans at a predictable date, around 2050. 
  2. incomprehensible future, where it is a point beyond which we cannot predict what will happen, a wall that we can't see pass when we look to the future. Vinge thinks it would most likely be caused by superhuman AGI.
  3. intelligence explosion: sufficiently intelligent beings can improve their own intelligence, or create more intelligent beings, this creates an rapid increase in intelligence.

Vernor Vinge

Vinge in 1993 first used the word "singularity", and predicted that the Singularity should happen in 2005-2030. In 2010, he still predicted it would happen before 2030. Now it's 2018, the prediction seemed too optimistic, but I do believe it would happen before humans go extinct, and I would bet it happens before 2050.

Vinge gave 4 paths to singularity:
  1. Artificial Intelligence (AI).
  2. Intelligence Amplification (IA): humans become enhanced digitally and biologically, and thus get smarter and more capable of making smarter beings.
  3. Computer Networks plus Humanity: like IA, but the enhancement is more "external", in the Internet [and other nets].
  4. Digital Gaia: Fine-grained distributed systems: Earth becomes so full of computers in an Internet of Things, that it comes alive like a Gaia intelligence, an Earth-brain.
He noted that the singularity is like the Cambrian explosion, and felt confident that humans would still have a place in the future, like a "subroutine-threaded code", or the mitochondria in the cells. Well, good for him. Maybe. If the world is fundamentally messy, then messy humans might just survive like that. If the world is neatable, then the future AIs would likely neat humans out of existence.

Hans Moravec

Machines with human-like performance will make economic sense only when they cost less than humans, say when their "brains" cost about $1,000. When will that day arrive? ...computer performance in the late 1990s seems to be doubling every 12 months. 
Moravec (1997) calculated that human-brain-level computation will be available in cheap machines in the 2020s. To calculate, he graphed the growth of MIPS/$\$$1000 [million instructions per second per $\$$1000, adjusting for inflation] to time and drew a trend line.

He then told a story about "landscape of human competence":
... having lowlands with labels like "arithmetic" and "rote memorization", foothills like "theorem proving" and "chess playing," and high mountain peaks labeled "locomotion," "hand-eye coordination" and "social interaction." We all live in the solid mountaintops, but it takes great effort to reach the rest of the terrain, and only a few of us work each patch.
Advancing computer performance is like water slowly flooding the landscape. A half century ago it began to drown the lowlands, driving out human calculators and record clerks, but leaving most of us dry. Now the flood has reached the foothills, and our outposts there are contemplating retreat. We feel safe on our peaks, but, at the present rate, those too will be submerged within another half century. I propose that we build Arks as that day nears, and adopt a seafaring life!
No word as to what "seafaring" could possibly mean in this analogy, but it seems to mean something like "control the robots so they do humans' bidding". Good luck with that, it'd require solving the AI-human alignment problem.

Chapter 1


This chapter introduces RL, and describes the basic architecture of a RL agent. It would contain these parts:
  • policy [aka stimulus-response rules]: what to do next, based on the observation so far?
  • value function: how much reward can be expected in the future?
  • [optional] environment model: what is the outside world like?
  • inputs: observation, and reward signal.
  • output: action.
1.6 gives an extended example of a tictactoe RL agent. It uses: 
  • Game value chart: start with a chart. On the left is all the possible board positions, on the right are the board values. 1 means "win" and 0 means "lose". 0.5 means "unsure".
  • Move policy: by the ε-greedy algorithm. ε chance of exploring a random move, and 1-ε chance of using the current best move.
  • Update policy is a temporal-difference learning method: $$V(S_t) \leftarrow (1-\alpha)V(S_t) + \alpha V(S_{t+1})$$
  • $\alpha$ is the step-size parameter.

1.7 is a lengthy history that I skipped entirely.

Exercises of chap 1

Ex. 1.1

It won't learn any opponent exploits, since it won't play against any opponents at all. It would be very unexploitable, essentially minimax.

Ex. 1.2

Process the observation signal, so that symmetric positions become considered identical. It makes the program run faster, and policy smaller.
Shouldn't, since the opponent might have some asymmetric weakness, which we can exploit only if we don't identify symmetric positions. Like, if the opponent is mortally afraid of playing on the top-left place, then that can be exploited.

Ex 1.3

If value estimation is accurate enough, then not a problem, else it can be beaten by a less greedy player that has perfect value estimation.

Ex 1.4

It "smears" out the value. Imagine an old man who has a perfect brain, but bad shaky hands so that with probability ε it would play a move completely randomly. This old man would then play with awareness of his trembling-hand. In particular, he would not play long exact sequences, instead play short and secure sequences that are robust under perturbation. Basically, not stunning tricks that need perfect execution, but good tricks that can be saved even if you make a wrong move.

If we keep exploring forever, then we should learn to include effect of tremble. If we intend to stop exploring eventually, and keep playing without exploring, then we should not learn the tremble.

The convergent value function, with trembling hand, is
$$V(S_t) = (1-\epsilon)V(S_{t+1}) + \epsilon \sum V(S')$$
where, $S_{t+1}$ is the best next move according to the value function, and $S'$ ranges over all states reachable from $S_t$, but not equal to $S_{t+1}$, that is,
$$S_{t+1} = \underset{S\text{ is reachable from } S_t}{\operatorname{argmax}} V(S)$$

Ex. 1.5

Use a giant lookup table and be done with tictactoe forever.

No comments:

Post a Comment

Let's Read: Neuropath (Bakker, 2009)

Neuropath  (Bakker 2009) is a dramatic demonstration of the eliminative materialism worldview of the author R. Scott Bakker. It's very b...