Java StringTokenizer - strings, words, and punctuation marks

I was just reading the book, Hadoop in Action, and came across a nice, simple way to use the Java StringTokenizer class to break a sentence (String) into words, taking into account many standard punctuation marks. Before looking at their solution, first take a look at the code they used to break a String into words using whitespace (a blank):

The Linux wc command (word count)

The Linux word count command is named wc. The wc command counts the number of characters, words, and lines that are contained in a text stream. If that sounds simple or boring, it's anything but; the wc command can be used in Linux command pipelines to do all sorts of interesting things.

Let's take a look at some Linux wc command examples to show the power of this terrific little command.

Java - strip unwanted characters from a string

Here's a quick line of Java code that takes a given input string, strips all the characters from that string other than lowercase and uppercase letters, and returns whatever is left:

A Ruby script to remove binary (garbage) characters from a text file

Problem: You have a file that should be a plain text file, but for some reason it has a bunch of non-printable binary characters (also known as garbage characters) in it, and you'd like a Ruby script that can create a clean version of the file.

Solution: I've demonstrated how to do this in another blog post by using the Unix tr command, but in case you'd like a Ruby script to clean up a file like this, I thought I'd write up a quick program and share it here.

A sed command to display non-visible characters in a text file

I just ran into a need to see what non-printable (non-visible?) characters were embedded in a text file in a Unix system, when I remembered this old sed command:

sed -n 'l' myfile.txt

Note that the character in that sed command is a lower-case letter "L", and not the number one ("1").

This command shows the contents of your file, and displays some of the nonprintable characters with the octal values. On some systems tab characters may also be shown as ">" characters.