Spelling corrector in Haskell

18 Dec 2014

There are very few articles that every now and then keep coming up on Hacker News. One of them is Peter Norvig's How to Write a Spelling Corrector. For those of you who don't know Peter Norvig, he's a really smart guy who also happens to be Director of Research at Google.

Having recently started learning Haskell, I decided to take over the challenge of re-implementing his spelling corrector.
One imporant thing to say is that, since there seems to be a leaderboard on Peter Norvig's website for the shortest implementations, I wrote this code trying to come up with the possible shortest version, putting brevity over readability, which is something I usually never do. But hey, sometimes we don't want to be too serious and it's just FUN!
Now, let me say that Python is a very nice language and a lot of people love it because it looks like pseudo-code and it's really easy to prototype in it. That's probably the main reason I still believe his version is better than mine, more readable yet even shorter.

Here's my take:

 1 import           Data.Char (toLower, isAlpha)
 2 import           Data.List (sortBy)
 3 import qualified Data.ByteString.Char8 as B
 4 import qualified Data.Map.Lazy as M
 5 import qualified Data.Set as S
 6 
 7 alphabet = "abcdefghijklmnopqrstuvwxyz"
 8 nWords = B.readFile "big.txt" >>= \ws -> return (train (lowerWords (B.unpack ws)))
 9 lowerWords = filter (not . null) . map (map toLower . filter (isAlpha )) . words
10 train = foldr (\ x acc -> M.insertWith (+) x 1 acc) M.empty
11 
12 edits1 w = S.fromList $ deletes w ++ transposes w ++ replaces w ++ inserts w
13   where splits s = [ splitAt n s | n <- [0..((length s) - 1)] ]
14         deletes = (map (\(a, b) -> a ++ (tail b))) . splits
15         transposes w = [ a ++ [b !! 1] ++ [b !! 0] ++ (drop 2 b) | (a,b) <- splits w , length b > 1 ]
16         replaces w = [a ++ [c] ++ (tail b) | (a,b) <- splits w , c <- alphabet]
17         inserts w = [a ++ [c] ++ b | (a,b) <- splits w , c <- alphabet]
18 edits2 w = S.foldr (S.union) S.empty (S.map edits1 (edits1 w))
19 knownEdits2 w nwords = (edits2 w) `S.intersection` (M.keysSet nwords)
20 known inputSet nwords = inputSet `S.intersection` (M.keysSet nwords)
21 
22 choices w ws = (known (S.singleton w) ws) `orNextIfEmpty` (known (edits1 w) ws)  `orNextIfEmpty` (knownEdits2 w ws) `orNextIfEmpty` (S.singleton w)
23   where orNextIfEmpty x y = if S.null x then y else x
24 
25 chooseBest ch ws = chooseBest' (ws `M.intersection` (M.fromList (map (\x -> (x,0)) (S.toList ch))))
26   where chooseBest' bestChs = head $ (map fst (sortCandidates bestChs))
27         sortCandidates = (sortBy (\(_,c1)(_,c2) -> c2 `compare` c1)) . M.toList
28 
29 correct w = nWords >>= \ws -> return (chooseBest (choices w ws) ws)

I am not going to explain the algorithm, since you can read it up in Norvig's article. I am just going to explain what I like and I don't like in my Haskell version, adding type annotations previously omitted.


Let's start off with reading the file and training the probability model. Nothing special here. Just some mapping and filtering to create the Map of known words, our probability model.

nWords :: Num a => IO (M.Map [Char] a)
nWords = B.readFile "big.txt" >>= \ws -> return (train (lowerWords (B.unpack ws)))

lowerWords :: String -> [String]
lowerWords = filter (not . null) . map (map toLower . filter (isAlpha )) . words

train :: (Ord k, Num a) => [k] -> M.Map k a
train = foldr (\ x acc -> M.insertWith (+) x 1 acc) M.empty


Things start getting interesting creating all the possible edits at distance 1.
Haskell's list comprehensions are just amazing. Look for example at the inserts function: instead of 2 horrible for loops you can just read "combine all the splits for each letter of the alphabet". This is where Haskell really shines.

edits1 :: String -> S.Set String
edits1 w = S.fromList $ deletes w ++ transposes w ++ replaces w ++ inserts w
  where splits :: [a] -> [([a], [a])]
        splits s = [ splitAt n s | n <- [0..((length s) - 1)] ]
        deletes :: [a] -> [[a]]
        deletes = (map (\(a, b) -> a ++ (tail b))) . splits
        transposes :: [a] -> [[a]]
        transposes w = [ a ++ [b !! 1] ++ [b !! 0] ++ (drop 2 b) | (a,b) <- splits w , length b > 1 ]
        replaces :: String -> [String]
        replaces w = [a ++ [c] ++ (tail b) | (a,b) <- splits w , c <- alphabet]
        inserts :: String -> [String]
        inserts w = [a ++ [c] ++ b | (a,b) <- splits w , c <- alphabet]


Creating all the edits at distance 2 is trivial applying edits1 twice with map and then reducing the sets using foldr. Classic map-reduce right here. Creating the sets of known edits at distance 2 and the set of known words is just set interesection.

edits2 :: String -> S.Set String
edits2 w = S.foldr (S.union) S.empty (S.map edits1 (edits1 w))

knownEdits2 :: String -> M.Map String a -> S.Set String
knownEdits2 w nwords = (edits2 w) `S.intersection` (M.keysSet nwords)

known :: Ord a => S.Set a -> M.Map a b -> S.Set a
known inputSet nwords = inputSet `S.intersection` (M.keysSet nwords)


Now the ugliest part. I couldn't find nothing as neat as Python's or operator so I had to create a little function to do that for me. That's not too bad. What I don't really like is the function chooseBest. Having to jump between sets and maps passing from lists is horrible. Probably even more terrible is having to manually sort the map based on the values.
I am sure there is a better way since Haskell is so powerful, I just couldn't come up with it.

choices :: String -> M.Map String a -> S.Set String
choices w ws = (known (S.singleton w) ws) `orNextIfEmpty` (known (edits1 w) ws)  `orNextIfEmpty` (knownEdits2 w ws) `orNextIfEmpty` (S.singleton w)
  where orNextIfEmpty x y = if S.null x then y else x

chooseBest :: (Ord b, Ord a) => S.Set a -> M.Map a b -> a
chooseBest ch ws = chooseBest' (ws `M.intersection` (M.fromList (map (\x -> (x,0)) (S.toList ch))))
  where chooseBest' bestChs = head $ (map fst (sortCandidates bestChs))
        sortCandidates = (sortBy (\(_,c1)(_,c2) -> c2 `compare` c1)) . M.toList


Finally the correct function glues everything together:

correct :: String -> IO String
correct w = nWords >>= \ws -> return (chooseBest (choices w ws) ws)


Trying it out:

λ> correct "sometihng"    # distance 1
"something"
λ> correct "soomehting"   # distance 2
"something"


I am sure I have done something completely wrong and there's something that could have been done in a better way. Feedback is really appreciated. The code is on GitHub.

UPDATE

A lot of people pointed out that this code is not readable and it goes against a lot of Haskell best practices. I know that, in fact writing the most elegant Haskell code wasn't my objective for this article. I just wanted to come up with the shortest implementation.

Some very kind people helped me out on GitHub by making the code cleaner and neater (gotta :heart: open source) and now the final implementation looks something like this:

 1 module Spelling (TrainingDict, nWords, correct) where
 2 
 3 import Paths_Norvigs_Spelling_Corrector (getDataFileName)
 4 import           Data.Char (toLower, isAlpha)
 5 import           Data.List (sortBy, foldl')
 6 import qualified Data.ByteString.Char8 as B
 7 import qualified Data.Map.Strict as M
 8 import qualified Data.Set as S
 9 import           Data.Ord (comparing)
10 
11 type WordSet = S.Set String
12 type TrainingDict = M.Map String Int
13 
14 alphabet :: String
15 alphabet = ['a' .. 'z']
16 
17 nWords :: IO TrainingDict
18 nWords = do
19   ws <- getDataFileName "big.txt" >>= B.readFile
20   return (train . lowerWords . B.unpack $ ws)
21 
22 lowerWords :: String -> [String]
23 lowerWords = words . map normalize
24   where normalize c = if isAlpha c then toLower c else ' '
25 
26 train :: [String] -> TrainingDict
27 train = foldl' (\acc x -> M.insertWith (+) x 1 acc) M.empty
28 
29 edits1 :: String -> WordSet
30 edits1 w = S.fromList $ deletes ++ transposes ++ replaces ++ inserts
31   where
32     splits = [ splitAt n w | n <- [0 .. length w - 1] ]
33     deletes = map (\(a, b) -> a ++ tail b) splits
34     transposes = [ a ++ [b !! 1] ++ [b !! 0] ++ drop 2 b
35                  | (a,b) <- splits, length (take 2 b) > 1 ]
36     replaces = [ a ++ [c] ++ tail b
37                | (a,b) <- splits, c <- alphabet]
38     inserts = [ a ++ [c] ++ b
39               | (a,b) <- splits, c <- alphabet]
40 
41 edits2 :: String -> WordSet
42 edits2 = S.foldl' S.union S.empty . S.map edits1 . edits1
43 
44 knownEdits2 :: String -> TrainingDict -> WordSet
45 knownEdits2 w nwords = edits2 w `S.intersection` M.keysSet nwords
46 
47 known :: WordSet -> TrainingDict -> WordSet
48 known inputSet nwords = inputSet `S.intersection` M.keysSet nwords
49 
50 choices :: String -> TrainingDict -> WordSet
51 choices w ws = foldr orNextIfEmpty (S.singleton w)
52   [ known (S.singleton w) ws
53   , known (edits1 w) ws
54   , knownEdits2 w ws
55   ]
56   where orNextIfEmpty x y = if S.null x then y else x
57 
58 chooseBest :: WordSet -> TrainingDict -> B.ByteString
59 chooseBest ch ws = maximumBy (compare `on` (\w -> M.findWithDefault 0 w ws)) (S.toList ch)
60 
61 correct :: TrainingDict -> String -> String
62 correct ws w = chooseBest (choices w ws) ws