Writing a Haskell version of Peter Norvig's spelling corrector is now considered as a standard Haskell exercise. There has been several solutions, of Bryan O'Sullivan, Tim Robinson, or Marco Sero.
My solution gives a slightly different perspective. The emphasis is on
- The list monad
- Writing type classes
- Polymorphic functions
The type class and monad ideas are already present in Norvig's code!
Indeed, list comprehension in Python is essentially using the monadic structure of lists.
The type class (admittedly here completely overkill) simulates the or
function in Python.
Finally, the code is written with polymorphic functions. The reason is that there is less possible mistakes when writing such a code. In most of what follows, the code works not only for strings, but for lists of any type.
The code is on GitHub.
Design choices
Some comments about some design choices. First, I don't think that sets are necessary here. Everything is much simpler with lists, although we'll have to carry around duplicates. This is a price to pay, the alternative being the costly procedure of eliminating all duplicates.
The other choice is to use the module Data.Text
instead of Data.ByteString
, as this seems to be more suitable for, well, text.
This is only relevant in the construction of the dictionary, as the rest of the code, being polymorphic, has nothing to do with strings.
The code
At this point, you should take another look at Norvig's Python implementation, to understand the basic principle of how this all works.
The editors
First, prepare all the “editors”:
transpose :: [a] -> [a]
transpose [] = []
transpose [_] = []
transpose (a:b:xs) = b:a:xs
delete :: [a] -> [a]
delete [] = []
delete (_:xs) = xs
insert :: [a] -> [a] -> [[a]]
insert letters word =
do -- List
l <- letters
return (l:word)
replace :: [a] -> [a] -> [[a]]
replace _ [] = []
replace letters (_:xs) = insert letters xs
Gather all the editors
allEditors :: [a] -> [[a] -> [[a]]]
allEditors letters = [return . transpose, replace letters, return . delete, insert letters]
Gather all the possible editing (one edit)
We need to define all the possible splits of a word:
splits :: [a] -> [([a],[a])]
splits word = zip (inits word) (tails word)
Note how this method avoids using splitAt
and a length
calculation, which is error-prone.
Finally, we can define the editOnce
function:
editsOnceWith :: [[a]->[[a]]] -> [a] -> [[a]]
editsOnceWith editors word = do -- List
(begin,end) <- splits word
editor <- editors
endedit <- editor end
return (begin ++ endedit)
I really like the do
notation for the list monad, but you could also write it in a big list comprehension:
[begin ++ endedit |
(begin,end) <- splits word,
editor <- editors,
endedit <- editor end]
Read from the dictionary of known words
We are now ready to gather all the known elements from the dictionary of known words:
inDict :: (Eq k, Ord k) => M.Map k v -> [k] -> [(k,v)]
inDict dict = mapMaybe myLookup
where
myLookup w = -- returns either Nothing or Just (w,f) if w is found in dict (f is the frequency)
do -- Maybe
f <- M.lookup w dict
return (w,f)
Python's or
function
Before building the list of known edited words, we implement a Haskell version of Python's or
function.
We need to define what it means to be “Bool-like”:
class BoolLike a where
falthy :: a -> Bool
bempty :: a
We are going to use that a list is “truthy“ if it is not empty:
instance BoolLike [a] where
bempty = []
falthy = null
Now we define a monoid structure which selects the first truthy element:
newtype First a = MkFirst { getFirst :: a } deriving (Show, Eq)
instance BoolLike a => Monoid (First a) where
mempty = MkFirst bempty
mappend (MkFirst l) (MkFirst r) = MkFirst (if falthy l then r else l)
List monadic composition
Notice how editOnce
has (essentially) the type signature
[a] -> [[a]]
It means that in order to edit a word twice, we need to compose that function with itself.
This is a textbook example of a monadic composition (here with the list monad).
Monadic (or Kleisli) composition is obtained with (<=<)
in the following code snippet.
Minimal number of edits to get some known words
We now use the First
monoid structure to select the first group of known words which is not empty:
allChoices :: ([a] -> [(a,b)]) -> (a -> [a]) -> a -> [(a,b)]
allChoices inDict' edits1 word = getFirst possibilities
where
possibilities =
mkFirst return
<> mkFirst edits1
<> mkFirst (edits1 <=< edits1)
mkFirst edit = (MkFirst . inDict' . edit) word
Choose the best candidate
In order to choose the best candidate, we need a function which is not in the Haskell library, so we implement it!
It is simply a variant of maximumBy
which works on empty lists as well.
In order to work, such a function needs a default value.
The implementation is straightforward:
maxByOrDefault :: (Foldable t) => (a -> a -> Ordering) -> a -> t a -> a
maxByOrDefault comp = foldl' (\ e e' -> if comp e' e == GT then e' else e)
And we now choose the best of all the candidates:
chooseBest :: (Ord k, Ord v, Num v) => k -> [(k,v)] -> k
chooseBest nothing choices' = fst bestPair
where bestPair = maxByOrDefault (comparing snd) (nothing, 0) choices'
Piecing all together
Piecing all together (and specializing to the case of strings) is now very simple:
alphabet :: String
alphabet = ['a' .. 'z']
correct :: TrainingDict -> String -> String -> String
correct dict notfound word = chooseBest notfound choices
where
choices = allChoices (inDict dict) (editsOnceWith (allEditors alphabet)) word
The training dictionary
We need to creating a big hash map from a text file.
I have hardly changed anything from the original code except that I use Text.unpack
and Data.Text.IO.readfile
to read the file (ByteString
does not sound like such a good idea).
nWords :: IO TrainingDict
nWords = do -- IO
fileName <- getDataFileName "big.txt"
ws <- readFile fileName
return ((train . lowerWords . unpack) ws)
lowerWords :: String -> [String]
lowerWords = words . fmap normalize
where normalize c = if isAlpha c then toLower c else ' '
train :: (Ord k, Num v) => [k] -> M.Map k v
train trainWords = foldl' increment M.empty trainWords
where
increment dict x = M.insertWith (+) x 1 dict
Things to do
One remaining thing to do is to write proper tests, using a dummy dictionary.
Edit: The code now follows the excellent readability advice of Gabriel Gonzalez.