Chapter 22-25 are about interactions with the environment:
- 22 is about reading and
- 23 is about writing.
- 24 is about feeling (perception) and
- 25 is about moving (robotics).
Probabilistic Language
Languages are ambiguous, in meaning, spelling, etc, so everything should be probabilistic.
N-Grams
Given a corpus (a very large sequence of characters), $P(c_{1:N}) =$ the frequency of a sequence of character sequence $c_{1:N}$. An n-gram model is a Markov chain of order $n − 1$.\[P(c_{n} = c | c_{1:(n-1)}) = P(c_{n} = c | c_{n-N+1 : n-1})\]
which by Markov chain property gives
\[P(c_{1:n}) = \prod_{i=1}^{n} P(c_{i} | c_{i-N+1:i-1}) \](We brush aside the edge case of what to do with $c_i$ when $i < 1$. A cheap solution is to fill it with whitespaces.)
Bayesian language classification/text categorification
\[l^* = \underset{l}{\operatorname{argmax}}P(l)P(c_{1:n} | l) \]\[P(c_{1:n} | l) = \prod_{i=1}^{n} P(c_{i} | c_{i-N+1:i-1}; l)\]
Can be used to do spam filtering, language recognition (hard to do with Nynorsk vs Bokmål though), etc.
Perplexitiy
Given a sequence of characters/words $c_{1:n}$, the raw number $P(c_{1:n})$, the probability that this particular sequence would be what you get if you pick a random one among all sequences of $n$ characters/words, is usually too small to be numerically stable, though, and so we define perplexity.\[Perp(c_{1:n} ) = P(c_{1:n})^{-\frac{1}{n}}\]
Its log is in a way, the average entropy of the sequence. A high entropy means high randomness (Pinkie Pie speech), a low entropy means low randomness (Applejack speech).
\[\log(Prep(c_{1:n})) = -\frac{1}{n}\sum_{i=1}^{n} \log P(c_i|c_{i-N+1: i-1})\]
Compression Classification
If a message is spam, it should compress well when appended to a corpus of spams. This is the idea behind Spam Filtering Using Statistical Data Compression Models (2006). It works quite well even with standard general-purpose compression algo, showing a relation between compression and intelligence.
Information Retrieval
Use a query language to get results from a corpus.Boolean keywords model, BM25, PageRank, HITS, Jeopardy Watson...
\[\text{Precision} = \frac{\text{relevant retrieved}}{\text{retrieved}}\]
\[\text{Recall} = \frac{\text{relevant retrieved}}{\text{relevant}}\]
Note the similarity with type 1 and type 2 errors...
Machine reading: too much technical stuff I won't need, skipped.
A cool quote
Natural languages abhor absolute synonyms just as nature abhors a vacuum. ---- Lexical Semantics (1986), D. Alan Cruse
Chapter 23 is about speaking and writing
*nods nods* |
The horse raced past the barn fell. ---- Lasts words of the horse raced past the barn.Grammer, syntactics, semantics, etc. All of them are computational linguistics stuff that I won't need. Skipped.
S Bitch wreck ignition
Or speech recognition.Back in 2009 when this book was written, an error rate of 10% was standard. Now there's human parity, as achieved by Microsoft in 2017.
No comments:
Post a Comment