I was just reading the book, Hadoop in Action, and came across a nice, simple way to use the Java StringTokenizer class to break a sentence (String) into words, taking into account many standard punctuation marks. Before looking at their solution, first take a look at the code they used to break a String into words using whitespace (a blank):
StringTokenizer tokenizer = new StringTokenizer(sentence);
Unfortunately that doesn't account for punctuation characters, so strings like "but" and "but," are not seen as the same word. Here's the improved StringTokenizer call that does account for the comma, and many other standard punctuation marks:
StringTokenizer tokenizer = new StringTokenizer(sentence, " \t\n\r\f,.:;?!'");
As you can tell from the characters passed into the StringTokenizer, this approach handles a space, tab, newline and linefeed characters, period, colon, semi-colon, question mark, exclamation mark, brackets, and single-quotes.
If you're trying to break a text document or string down into words, this is a much more accurate approach than just using whitespace to separate words.