Implementation of Norvig's spell checker in Haskell

Writing a Haskell version of Peter Norvig’s spelling corrector is now considered as a standard Haskell exercise. There has been several solutions, of Bryan O’Sullivan, Tim Robinson, or Marco Sero.

My solution gives a slightly different perspective. The emphasis is on

The list monad
Writing type classes
Polymorphic functions

The type class and monad ideas are already present in Norvig’s code! Indeed, list comprehension in Python is essentially using the monadic structure of lists. The type class (admittedly here completely overkill) simulates the or function in Python.

Finally, the code is written with polymorphic functions. The reason is that there is less possible mistakes when writing such a code. In most of what follows, the code works not only for strings, but for lists of any type.

The code is on GitHub.

Design choices
The code
Piecing all together
- The training dictionary
Things to do

Design choices

Some comments about some design choices. First, I don’t think that sets are necessary here. Everything is much simpler with lists, although we’ll have to carry around duplicates. This is a price to pay, the alternative being the costly procedure of eliminating all duplicates.

The other choice is to use the module Data.Text instead of Data.ByteString, as this seems to be more suitable for, well, text. This is only relevant in the construction of the dictionary, as the rest of the code, being polymorphic, has nothing to do with strings.

The code

At this point, you should take another look at Norvig’s Python implementation, to understand the basic principle of how this all works.

The editors

First, prepare all the “editors”:

transpose :: [a] -> [a]
transpose [] = []
transpose [_] = []
transpose (a:b:xs) = b:a:xs

delete :: [a] -> [a]
delete [] = []
delete (_:xs) = xs

insert :: [a] -> [a] -> [[a]]
insert letters word = 
    do -- List
      l <- letters
      return (l:word)

replace :: [a] -> [a] -> [[a]]
replace _ [] = []
replace letters (_:xs) = insert letters xs

Gather all the editors

allEditors :: [a] -> [[a] -> [[a]]]
allEditors letters =  [return . transpose, replace letters, return . delete, insert letters]

Gather all the possible editing (one edit)

We need to define all the possible splits of a word:

splits :: [a] -> [([a],[a])]
splits word = zip (inits word) (tails word)

Note how this method avoids using splitAt and a length calculation, which is error-prone.

Finally, we can define the editOnce function:

editsOnceWith :: [[a]->[[a]]] -> [a] -> [[a]]
editsOnceWith editors word = do -- List
    (begin,end) <- splits word
    editor <- editors
    endedit <- editor end
    return (begin ++ endedit)

I really like the do notation for the list monad, but you could also write it in a big list comprehension:

[begin ++ endedit |
    (begin,end) <- splits word,
    editor <- editors,
    endedit <- editor end]

Read from the dictionary of known words

We are now ready to gather all the known elements from the dictionary of known words:

inDict :: (Eq k, Ord k) => M.Map k v -> [k] -> [(k,v)]
inDict dict = mapMaybe myLookup
    where
      myLookup w = -- returns either Nothing or Just (w,f) if w is found in dict (f is the frequency)
        do -- Maybe
          f <- M.lookup w dict
          return (w,f)

Python’s `or` function

Before building the list of known edited words, we implement a Haskell version of Python’s or function.

We need to define what it means to be “Bool-like”:

class BoolLike a where
    falthy :: a -> Bool
    bempty :: a

We are going to use that a list is “truthy“ if it is not empty:

instance BoolLike [a] where
    bempty = []
    falthy = null

Now we define a monoid structure which selects the first truthy element:

newtype First a = MkFirst { getFirst :: a } deriving (Show, Eq)

instance BoolLike a => Monoid (First a) where
    mempty = MkFirst bempty
    mappend (MkFirst l) (MkFirst r) = MkFirst (if falthy l then r else l)

List monadic composition

Notice how editOnce has (essentially) the type signature

[a] -> [[a]]

It means that in order to edit a word twice, we need to compose that function with itself. This is a textbook example of a monadic composition (here with the list monad). Monadic (or Kleisli) composition is obtained with (<=<) in the following code snippet.

Minimal number of edits to get some known words

We now use the First monoid structure to select the first group of known words which is not empty:

allChoices :: ([a] -> [(a,b)]) -> (a -> [a]) -> a -> [(a,b)]
allChoices inDict' edits1 word = getFirst possibilities
  where
    possibilities = 
      mkFirst return
      <>  mkFirst edits1 
      <>  mkFirst (edits1 <=< edits1) 
    mkFirst edit = (MkFirst . inDict' . edit) word

Choose the best candidate

In order to choose the best candidate, we need a function which is not in the Haskell library, so we implement it! It is simply a variant of maximumBy which works on empty lists as well. In order to work, such a function needs a default value. The implementation is straightforward:

maxByOrDefault :: (Foldable t) => (a -> a -> Ordering) -> a -> t a -> a
maxByOrDefault comp = foldl' (\ e e' -> if comp e' e == GT then e' else e)

And we now choose the best of all the candidates:

chooseBest :: (Ord k, Ord v, Num v) => k -> [(k,v)] -> k
chooseBest nothing choices' = fst bestPair
    where bestPair = maxByOrDefault (comparing snd) (nothing, 0) choices'

Piecing all together

Piecing all together (and specializing to the case of strings) is now very simple:

alphabet :: String
alphabet = ['a' .. 'z']


correct :: TrainingDict -> String -> String -> String
correct dict notfound word = chooseBest notfound choices
  where
    choices = allChoices (inDict dict) (editsOnceWith (allEditors alphabet)) word

The training dictionary

We need to creating a big hash map from a text file. I have hardly changed anything from the original code except that I use Text.unpack and Data.Text.IO.readfile to read the file (ByteString does not sound like such a good idea).

nWords :: IO TrainingDict
nWords = do -- IO
  fileName <- getDataFileName "big.txt"
  ws <- readFile fileName
  return ((train . lowerWords . unpack) ws)

lowerWords :: String -> [String]
lowerWords = words . fmap normalize
  where normalize c = if isAlpha c then toLower c else ' '

train :: (Ord k, Num v) => [k] -> M.Map k v
train trainWords = foldl' increment M.empty trainWords
  where
    increment dict x = M.insertWith (+) x 1 dict

Things to do

One remaining thing to do is to write proper tests, using a dummy dictionary.

Edit: The code now follows the excellent readability advice of Gabriel Gonzalez.