Tuesday, June 30, 2009

Anagram Code

Finally, here is the anagram code. It worked on my tests, but failed to build the anagram trie with the enable word list (which I no longer find a direct link to online). I'm getting a stack overflow error which implies there is a space leak and a bit of investigation turns up that it is somewhere in building the trie. I've tried different ways to force strictness, but to no (so far) avail. But if I run it with a smaller dictionary, or with +RTS -K16M, it works.

Notice that to keep from getting duplicates, the function "anagramsFor" only will produce a new word if it follows (or is equal to) "starter", which is set to empty to start, then to the last word generated. This should eliminate duplicates where anagramsFor generates a list like ["aaa", "bbb", "ccc"] and also one like ["aaa", "ccc", "bbb"].

It runs my (simple) test cases (using small dictionaries).

Here's the code :


{-# LANGUAGE BangPatterns #-}

{- run with +RTS -K16M for enable wordlist -}

import Char
import Data.List
import System
import qualified Data.Map as M
import Control.Monad.State
import Control.Monad.List

data Trie a b = Empty
| Node { entries :: (M.Map a (Trie a b)), ender :: Maybe b }
deriving (Eq, Show)

type AnagramTrie = Trie Char String
data AnagramHelper = Ah { trie :: AnagramTrie, strings :: M.Map String [String]}
deriving Show

type AnagramTree = Trie String [String]

emptyHelper = Ah { trie = Empty, strings = M.empty }

trieInsert :: (Ord a, Show a) => [a] -> b -> Trie a b -> Trie a b

trieInsert [] w Empty = Node {entries=M.empty, ender=Just w}
trieInsert [] w n = n -- should never happen

trieInsert !l1@(x:xs) !l Empty = Node {entries=M.singleton x (trieInsert xs l Empty ), ender=Nothing}

trieInsert !l1@(x:xs) !al !Node{entries=es, ender=en}
| M.member x es = let !st = trieInsert xs al in Node{entries=M.adjust st x es, ender=en}
| otherwise = let !st = trieInsert xs al Empty in Node{entries=M.insert x st es, ender=en}

findInTrie Empty _ = Nothing
findInTrie Node{entries = es, ender=en} [] = en
findInTrie Node{entries=es, ender=en} (c:cs)
| M.member c es = findInTrie (fromJust $ M.lookup c es) cs
| otherwise = Nothing

toString :: (Show a, Show b) => Trie a b -> String -> [String]
toString Empty indent = []
toString Node{entries=es, ender=en} indent = enderLine:subtrees
where
enderLine = case en of
Nothing -> []
Just l -> (indent ++ "-" ++ show l)
doEntry k t = (indent ++ " " ++ show [k]):(toString t (" "++indent))
subtrees = concat $ M.elems $ M.mapWithKey doEntry es

-- extract the paths in a trie (ignoring the entries, if any)
pathsThrough :: Trie a b -> [[a]]
pathsThrough Empty = [[]]
pathsThrough Node{entries=es,ender=en} = l
where
l = concatMap (\(x,y) -> (map (x:)) (pathsThrough y)) $ M.assocs es

allWhitespace s = and $ map isSpace s

makeAnagramHelperFromList l = foldl' ahInsert emptyHelper l

ahInsert !a@Ah{trie=t,strings= s} !word
| M.member sw s = Ah{trie=t, strings = M.insertWith' insertUnique sw [word] s}
| otherwise = Ah{trie=trieInsert sw sw t, strings=M.insert sw [word] s}
where
sw = sort word
insertUnique [w] l = if w `elem` l then l else (w:l)

clean x = filter (not.null) (nub x) -- profiling shows this is a bit faster

-- These are some common lists of words - one per line used here
-- these (obviously) live in my filesystem - you might use /usr/dict/share/words if
-- you're on a unix system

-- wordfile = "words/american-english-large-no-quotes"
-- wordfile = "words/american-english-smaller"
-- wordfile = "words/shorter-wordlist"
-- wordfile = "words/random-words"
-- wordfile = "words/seethe"

strip s = reverse $ dropWhile isSpace $ reverse $ dropWhile isSpace s

allchars w = (not.or) $ map (\x -> x < 'a' || x > 'z') w

-- most dictionaries include all the letters on their own, so
-- filter out all one letter words, but toss "a" and "i" back in
-- also strip the strings as there are some cases where they have trailing spaces
-- sometimes there seem to be duplicates (probably because there are caps in some versions)
-- in this version duplicates are handled by the builder

getWordsFromList l = ("a" :) $ ("i" :) $ filter (\x->(length x) > 1) $ filter allchars $ map strip $ (lines $ map toLower l)

makeAnagramHelperFromFile fn = do
rawWords <- readFile fn
let wordList = getWordsFromList rawWords
return $! makeAnagramHelperFromList wordList

getStringsFromTopWithString :: AnagramTrie -> String -> [String]

getStringsFromTopWithString Empty _ = []
getStringsFromTopWithString _ [] = []
getStringsFromTopWithString topTrie@Node{entries=es, ender=en} s@(c:cs) =
let maybeSubTree = M.lookup c es
in case maybeSubTree of
Nothing -> []
Just thisSubTree -> maybePrepend en $ findAllSubsetStringsInSubTree thisSubTree cs

findAllSubsetStringsInSubTree Empty _ = []
findAllSubsetStringsInSubTree Node{ender=en} [] = maybePrepend en []
findAllSubsetStringsInSubTree t@Node{entries=es, ender=en} (c:cs) =
let next = M.lookup c es -- does this first character in this node have a subtree ?
subs = findAllSubsetStringsInSubTree t $ dropWhile (==c) cs -- no, skip all of those and try the rest
in case next of
Nothing -> subs
Just t' -> maybePrepend en $ findAllSubsetStringsInSubTree t' cs ++ subs

maybePrepend Nothing l = l
maybePrepend (Just x) l = x:l

fromJust (Just l) = l

testWords = ["at", "tear", "rat", "ate", "sea", "teas", "tea", "erase", "tar", "art", "eat", "tears", "stare", "reset", "stares", "star" , "tartar", "seatea", "erasetar", "tee", "see"]

getTestTrie = trie getTestAH

getTestAH = makeAnagramHelperFromList testWords

anagramsFor starter ah@Ah{trie=t, strings=m} [] = Just Empty
anagramsFor starter ah@Ah{trie=t, strings=m} s = anatree
where
s' = sort s
slist = getStringsFromTopWithString t s'
subtreeskeys = filter (\(x,y) -> y /=Nothing) $ map (\x -> (x,anagramsFor x ah (s' \\ x))) slist
subtreeskeys1 = filter (\(x,y) -> x>=starter) subtreeskeys
mkAnaTrieEntries m (x, Nothing) = m
mkAnaTrieEntries m (x, Just y) = M.insert x y m

anatree = if subtreeskeys1 == []
then Nothing
else Just Node{entries=foldl mkAnaTrieEntries M.empty subtreeskeys1, ender=Just s}

simpleTest s = do
let t = anagramsFor [] getTestAH s
t' = fromJust t
mapM_ putStrLn $ toString t' ""
mapM_ putStrLn $ map show $ pathsThrough t'

main = do
(wl, strings) <- getCommandLineArgs
putStrLn $ "Using dictionary : " ++ wl
ah <- makeAnagramHelperFromFile wl
mapM_ (getAnagrams ah) strings

getCommandLineArgs = do
al <- getArgs
if length al < 1
then error "Args : [-dictionary] strings "
else let (s:xs) = al !! 0
in if s == '-'
then return (xs,drop 1 al)
else return ("words/enable.list", al)

listVariants :: AnagramHelper -> [String] -> [[String]]
listVariants Ah{strings=m} sl = map getWords sl
where
getWords s = fromJust $ M.lookup s m

getAnagrams ah s = do
putStrLn $ "getting anagrams for " ++ s
let at = anagramsFor [] ah s
case at of
Nothing -> putStrLn $ "No anagrams for: " ++ s
Just t -> do
mapM_ putStrLn $ toString t ""
let p = pathsThrough t
mapM_ putStrLn $ map show p
let v = map (listVariants ah) p
mapM_ putStrLn $ map show v

Monday, June 29, 2009

ICFP programming contest

Last Friday at 11:16 (local time) the 2009 ICFP programming contest started. One of our students and I gave it a try. Currently we're about 190/317 (with about 30 minutes to go) which isn't great. But it was fun and I spent much of my time between the start of the contest and yesterday about 5 pm working on it.

Lots of things I did wrong - our virtual machine ran nicely (once a problem with the specification was fixed and announced) but our interface to it started all wrong. Once that was fixed there were a bunch of bugs (as usual) many of them dealing with signs of quantities. Anyway we solved all the first level problems and most of the second level problems and I had figured out the basics of the third level problems (I think).

Biggest thing that would have helped us was better communications. We were working remotely (from each other) and being in one place to share a whiteboard and be able to more quickly talk about what was wrong and draw pictures would have helped immensely.

Sunday, June 21, 2009

Reading with the Flu

Of course, I was wrong in my previous post about not having to worry about multiplicity (and indeed my code doesn't do that). But the code needs to be a bit cleaned up before posting.

Over the last four days I've had some kind of mini-flu. Temperature in the 100.5 range and a complete inability to feel either warm enough or cool enough for more than about an hour. Not important as such, and it seems to be gone now, but it did lead to my spending quite a bit of time reading. This has included samples from the most recent "New Yorker", CACM, and IEEE Computer, but relatively little online stuff (sitting up was just too tiring).

Most of the reading was science fiction/fantasy and not overly demanding (the kind of density that works well for airline travel and mild flu). I would note that I don't usually read anything like this much paper in most four day periods.

The City and the City by China Miéville. Miéville is a fascinating writer. I've read his three books set in Bas-Lag, a very strange world but one that hangs together nicely. These are : Perdido Street Station, The Scar, and The Iron Council and are often characterized as the "new weird" but while they are certainly weird enough in setting, the stories and themes are very human. His newest The City and the City is a murder mystery and in many ways fits the conventions of that genre pretty nicely. Except that instead of being set in somewhere that we might know, it is set in two twin but separate cities Bezel (there is an accent over the "z" that I don't know how to generate) and Ul Qoma. What makes this interesting is that the cities overlap in some undefined way, and residents learn to "unsee" things in the other city that may be occurring across the street (or closer). This prompted me (I hesitate to say "the reader") to contemplate just this kind of overlap that occurs for other reasons - perhaps Jerusalem or some city in Europe where the Jews were forced to live apart (not necessarily in ghettos - just "apart") or even a city in the US where classes may intermix without intermixing. Think of the oft mentioned example of a black man failing to get a cab in New York where a white man just down the block succeeds.

I have not yet read Un Lun Dun by Miéville but look forward to it. I also added more than a few books to my (far too large and unlikely ever to even begin to empty significantly) Amazon wishlist after reading a list of book suggestions by him.

Candle by John Barnes. Just as I was ready to put this down thinking I'd figured it all out, he twists things just enough, and then it happens again. I'm not sure it is worth a re-read, but it was a good read-once.

Against a Dark Backround by Iain M. Banks. I've read a couple of his "Culture" novels and liked them well enough. This was a good read, but in a kind of unrelenting sameness sort of way. Lots of imaginative touches, but tossed in almost at random. I must also admit that books and movies where everyone gets killed off (one by one by one) to be less than rewarding in general. There are exceptions, but there needs to be pretty substantial other rewards.

Killing Time by Caleb Carr. I've enjoyed Carr's 19th century mysteries, but not enough to seek out any more. This one fell into my hands (as did several of the others I read in the last couple days) at a yard sale where someone who had a serious interest in science fiction was selling things off. I wanted to like it, but it just didn't work.

I finished Before and After by Rosellen Brown. This was made into a movie a while back. The book is pretty good and well worth reading. The narration from the viewpoint of several characters reads a bit false though as only one of the characters writes in the first person - is this narrator also the writer of the other viewpoints? Certainly a possibility, but not one that seemed to me to add much to the story.

I also read Slant by Greg Bear and Beowulf's Children by Niven, Pournelle and Barnes - on which I will not comment, and started A Distant Mirror by Barbara Tuchman and Snowqueen by Joan Vinge.






Wednesday, June 17, 2009

One of the advantages of doing things this way is that now we can build the tree containing the pieces of the anagram in such a way as to guarantee that there are no duplicates. Consider "staringstaring" again. This is sorted before searching for the anagrams giving "aaggiinnrrsstt". Now look up in the trie the first character of that string. Any anagram of the string must contain that first character, so, of course, if we find no lookup for "a" then there are no anagrams. If we find all the words in the subtrie of the trie containing "a" and any other characters of the string, then we can look up all the anagrams for them in the top level of the trie using the same procedure and building subtrees along the way. These then get stitched together into the complete tree. Even better, we don't need to even bother with the second "a" when we start at the top again because we already have all the anagrams that use an "a" and the multiplicity doesn't matter.

Even better, if you look at the structure of the tree of anagrams we're building, it is (drum roll please) just a trie again but with the indices being strings, so if we paramaterize the Trie we get the second tree we're using essentially for free. In this trie though, we don't need to remember values at the end points (though having something there doesn't hurt), we use the path down through the trie as the information we need. Indeed, if we construct it correctly, every node in the trie corresponds to a partial anagram, so the path to every leaf node describes the anagram (of sorted words, remember).

Of course, we're not using this trie the same way - the first trie is used as a search device, the second as a storage device, but the shape of the trie is the same in both cases. Also, the first trie is built from the top down - starting with an empty top node and adding things to it as we go, and the second from the bottom up. Also in the latter case, we need to prune empty subtries (which correspond to "failure to anagram").

Friday, June 12, 2009

Anagrams

I do (sporadically) cryptic crosswords. I used to do regular crosswords, but after a while they became less challenging - when I could do the NY Times Sunday Crossword in an hour or so (with perhaps five or six squares open), I started paying more attention to cryptics.

I'd been exposed to cryptics when I was in the Peace Corps and we could, from time to time, get the Zambian newspaper which had a cryptic although those were mostly beyond my skills. When I returned to the US, I found Stephen Sondheim's book of cryptics and was hooked - though they were difficult indeed, I could at least make headway with them (and now I wish I'd xeroxed the pages so I could do them again).

In any case, cryptics often use anagrams, so one day a while ago I wrote an anagram generator in Haskell. You gave it a dictionary and a string and it would generate all the anagrams from that string consisting of words in the dictionary. It was messy and didn't cope with repeated anagrams well.

Just a short time ago I decided to redo this and try to make it cleaner. I decided to reuse one of the main parts of the previous version, that of using a trie to store the words from the dictionary, but to change it around a bit. A trie is a kind of tree that stores subtrees indexed by prefix elements (in this case characters) of the objects stored in it. (Wikipedia has a much better picture than I can manage, I suspect). In the previous version, the nodes at the end of each string contained the words that ended up there. Words? Yes, I sorted the characters in the word before entering it in the trie, so multiple words ("tar", "rat", "art") could map to a single node. In this version, I'm still going to use a trie in essentially the same way, but instead of storing all the words that sort to the same string in the nodes, I'm going to store only the sorted string and provide for a map (Data.Map) to associate the sorted string with the list of words that are anagrams.

The main problem with the previous version was that a word could have multiple anagrams and these might (given the way I searched the trie) pop up in different orders. But the duplicates were always generated (which was both space and time costly) and then duplicates were eliminated (time costly). For example, "staring" would have ["rat", "sing"], ["tar", "gins"], ["art", "sing"], ["sing", "art"] appear in the list of anagrams, but these are really all more or less the same and could have been better represented as ["art", "igns"] where each anagram is in sorted order, and the anagrams themselves are sorted. Then, a final step could generate all the possibilities from this string by looking at the mapped values of "art" and "igns" and making all the combinations.

But there's another step. Consider "staringstaring". This sorts to "aaggiinnrrsstt". Suppose that we determine that "agg" is (I think it is) the first word alphabetically in the string (generated by the english word "gag"). Then we can drop those characters from the string and find all the anagrams for "aiinnrrsstt" (assuming there are any). If we build a tree that has a node with a string in it and subnodes a list of similar trees so that any path from the root to a leaf consists of a list of strings whose sorted concatenation is the string we're looking for anagrams for, and if we do it correctly, then it becomes easy to produce all the anagrams for the string.

More in a bit.