When working with text files on a Unix/Linux system, you'll occasionally run into a situation where a file will contain extended ASCII characters. These extended characters will generally appear to begin with
[characters in your text files. For instance, the vi/vim editor will show
^M characters in DOS text files when they are transferred to Unix systems, such as when using the ftp command in binary transfer mode. Oftentimes, you'll want to easily delete these characters from your files.
Having run into this problem throughout the years, I created a simple little script to remove these extended characters from my text files. The guts of the program is a one-line
tr command that prints only the characters I allow it to print, and removes all other characters.
The Unix/Linux “tr” command
If you haven't used the Unix
tr command before, you'll find that it’s an interesting utility that lets you translate characters in the
tr standard input stream into different characters in the
tr standard output stream. The
tr command is one of the true “filters” in the Unix operating system, because it works only on input/output streams, and not on files.
A simple example of the
tr command is shown below. This example converts the phrase
hello world into
jello world by replacing the letter
h in the input stream with the letter
j in the output stream:
$ echo "hello world" | tr h j jello world
As a second example, the
tr command can also be used to delete characters as they are read in from the input stream and written to the output stream. For instance, the following command converts the word
fred in the input stream into the word
red in the output stream, by deleting the letter
f in the translation process:
$ echo "fred" | tr -d f red
-d flag is what tells
tr to delete the characters you supply.
Removing all undesirable characters at once
In the shell script I use to remove all non-printable ASCII characters from a text file, I tell the
tr command that in its translation process it should delete every character in the input stream except for the characters I specify. In essence, I filter out the undesirable characters. The
tr command I use in that script is shown below:
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
In this command, the variable
INPUT_FILE must contain the name of the Unix file you’re reading from, and
OUTPUT_FILE must contain the name of the output file you’re writing to. When the
-d options of the
tr command are used in combination like this, the only characters
tr writes to the standard output stream are the characters you specify on the command line.
Using octal characters with ‘tr’
Although it may not look very attractive, I’m using octal characters in the
tr command to make the programming job easier and more efficient. This command tells
tr to retain only the octal characters (a)
12, and (c)
176 when writing to standard output. Octal character
11 corresponds to the
[Tab] character, and octal
12 corresponds to the
[Linefeed]character. The octal characters
176 correspond to the standard visible keyboard characters, beginning with the
[Space] character (octal 40) through the
~ character (octal 176). These are the only characters retained by
tr — the rest are filtered out, leaving you with a clean ASCII file.
If you haven’t used the
tr command before, I hope this tutorial has been helpful. As mentioned, the
tr command is a filter that helps you transform input streams into desired output streams. While this is good, for other, more complicated scenarios you may need to use the Unix/Linux
sed stands for “streamline editor,” and also lets you convert/transform input streams to desired output streams. For more information on
sed, see this link that of Unix sed tutorials on this website.