Friday, June 12, 2009

Anagrams

I do (sporadically) cryptic crosswords. I used to do regular crosswords, but after a while they became less challenging - when I could do the NY Times Sunday Crossword in an hour or so (with perhaps five or six squares open), I started paying more attention to cryptics.

I'd been exposed to cryptics when I was in the Peace Corps and we could, from time to time, get the Zambian newspaper which had a cryptic although those were mostly beyond my skills. When I returned to the US, I found Stephen Sondheim's book of cryptics and was hooked - though they were difficult indeed, I could at least make headway with them (and now I wish I'd xeroxed the pages so I could do them again).

In any case, cryptics often use anagrams, so one day a while ago I wrote an anagram generator in Haskell. You gave it a dictionary and a string and it would generate all the anagrams from that string consisting of words in the dictionary. It was messy and didn't cope with repeated anagrams well.

Just a short time ago I decided to redo this and try to make it cleaner. I decided to reuse one of the main parts of the previous version, that of using a trie to store the words from the dictionary, but to change it around a bit. A trie is a kind of tree that stores subtrees indexed by prefix elements (in this case characters) of the objects stored in it. (Wikipedia has a much better picture than I can manage, I suspect). In the previous version, the nodes at the end of each string contained the words that ended up there. Words? Yes, I sorted the characters in the word before entering it in the trie, so multiple words ("tar", "rat", "art") could map to a single node. In this version, I'm still going to use a trie in essentially the same way, but instead of storing all the words that sort to the same string in the nodes, I'm going to store only the sorted string and provide for a map (Data.Map) to associate the sorted string with the list of words that are anagrams.

The main problem with the previous version was that a word could have multiple anagrams and these might (given the way I searched the trie) pop up in different orders. But the duplicates were always generated (which was both space and time costly) and then duplicates were eliminated (time costly). For example, "staring" would have ["rat", "sing"], ["tar", "gins"], ["art", "sing"], ["sing", "art"] appear in the list of anagrams, but these are really all more or less the same and could have been better represented as ["art", "igns"] where each anagram is in sorted order, and the anagrams themselves are sorted. Then, a final step could generate all the possibilities from this string by looking at the mapped values of "art" and "igns" and making all the combinations.

But there's another step. Consider "staringstaring". This sorts to "aaggiinnrrsstt". Suppose that we determine that "agg" is (I think it is) the first word alphabetically in the string (generated by the english word "gag"). Then we can drop those characters from the string and find all the anagrams for "aiinnrrsstt" (assuming there are any). If we build a tree that has a node with a string in it and subnodes a list of similar trees so that any path from the root to a leaf consists of a list of strings whose sorted concatenation is the string we're looking for anagrams for, and if we do it correctly, then it becomes easy to produce all the anagrams for the string.

More in a bit.

No comments: