I was just reading the book, Hadoop in Action, and came across a nice, simple way to use the Java StringTokenizer class to break a sentence (String) into words, taking into account many standard punctuation marks. Before looking at their solution, first take a look at the code they used to break a String into words using whitespace (a blank):
The Linux word count command is named
wc command counts the number of characters, words, and lines that are contained in a text stream. If that sounds simple or boring, it's anything but; the
wc command can be used in Linux command pipelines to do all sorts of interesting things.
Let's take a look at some Linux
wc command examples to show the power of this terrific little command.
Here's a quick line of Java code that takes a given input string, strips all the characters from that string other than lowercase and uppercase letters, and returns whatever is left:
Problem: You have a file that should be a plain text file, but for some reason it has a bunch of non-printable binary characters (also known as garbage characters) in it, and you'd like a Ruby script that can create a clean version of the file.
Solution: I've demonstrated how to do this in another blog post by using the Unix tr command, but in case you'd like a Ruby script to clean up a file like this, I thought I'd write up a quick program and share it here.
For a variety of reasons you can end up with text files on your Unix filesystem that have binary characters in them. In fact, I showed you how to do this to yourself in my blog post about the Unix script command. (There’s nothing wrong with this approach; it’s just a by-product of using the script command.)
I just ran into a need to see what non-printable (non-visible?) characters were embedded in a text file in a Unix system, when I remembered this old sed command:
sed -n 'l' myfile.txt
Note that the character in that sed command is a lower-case letter "L", and not the number one ("1").
This command shows the contents of your file, and displays some of the nonprintable characters with the octal values. On some systems tab characters may also be shown as ">" characters.