After a fairly large number of emails I've started working on my type-ahead, predictive text editor project. In support of this effort I'm looking at different algorithms to best predict the word the user next wants to type. The first part of this is looking at documents I've written in the past, and analyzing the frequency of word occurrences within those documents. (I'll skip the details of my theory here, but it involves looking at how often we re-type words compared to how often we use new dictionary-based words beginning with the same characters.)
I wrote a little Ruby program to help me analyze this word frequency. The program opens a file, then adds each word in the file to a hash, where they word is the key and the number of occurrences of the word is the value. At the end of the program I print out the hash information, with the printout sorted in order by the hash value.
Without any further ado here is the Ruby "word frequency" program:
# from devdaily.com # # a sample ruby program that determines the number of times each word in # a file appears in that file (i.e., the frequency of that word). # the_file='/Users/Al/DD/Ruby/GettysburgAddress.txt' h = Hash.new f = File.open(the_file, "r") f.each_line { |line| words = line.split words.each { |w| if h.has_key?(w) h[w] = h[w] + 1 else h[w] = 1 end } } # sort the hash by value, and then print it in this sorted order h.sort{|a,b| a[1]<=>b[1]}.each { |elem| puts "\"#{elem[0]}\" has #{elem[1]} occurrences" }
Here's a little more discussion of the program:
- Create a String to store the file name
- Create a new Hash
- Open the file in read-only mode
- Read each line in the file, one line at a time
- Split each line into words (words separated by spaces)
- Put the word and the word frequency into the Hash (the word is the key, the frequency is the value)
- Print the hash, with the results sorted by the hash value
It may help to understand the program, so I'll show the last 10 lines of the output here:
"have" has 5 occurrences "not" has 5 occurrences "can" has 5 occurrences "and" has 6 occurrences "--" has 7 occurrences "a" has 7 occurrences "we" has 8 occurrences "to" has 8 occurrences "the" has 9 occurrences "that" has 13 occurrences
I'm not going to comment on my findings regarding the predictive text editor yet, but this has helped me understand the predictive issue better than ever before.