How to remove extended ASCII characters from Unix files with the 'tr' command

When working with text files on a Unix/Linux system, you'll occasionally run into a situation where a file will contain extended ASCII characters. These extended characters will generally appear to begin with ^ or [characters in your text files. For instance, the vi/vim editor will show ^M characters in DOS text files when they are transferred to Unix systems, such as when using the ftp command in binary transfer mode. Oftentimes, you'll want to easily delete these characters from your files.

Having run into this problem throughout the years, I created a simple little script to remove these extended characters from my text files. The guts of the program is a one-line tr command that prints only the characters I allow it to print, and removes all other characters.

The Unix/Linux “tr” command

If you haven't used the Unix tr command before, you'll find that it’s an interesting utility that lets you translate characters in the tr standard input stream into different characters in the tr standard output stream. The tr command is one of the true “filters” in the Unix operating system, because it works only on input/output streams, and not on files.

A simple example of the tr command is shown below. This example converts the phrase hello world into jello world by replacing the letter h in the input stream with the letter j in the output stream:

$ echo "hello world" | tr h j
jello world

As a second example, the tr command can also be used to delete characters as they are read in from the input stream and written to the output stream. For instance, the following command converts the word fred in the input stream into the word red in the output stream, by deleting the letter f in the translation process:

$ echo "fred" | tr -d f

The -d flag is what tells tr to delete the characters you supply.

Removing all undesirable characters at once

In the shell script I use to remove all non-printable ASCII characters from a text file, I tell the tr command that in its translation process it should delete every character in the input stream except for the characters I specify. In essence, I filter out the undesirable characters. The tr command I use in that script is shown below:

tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE

In this command, the variable INPUT_FILE must contain the name of the Unix file you’re reading from, and OUTPUT_FILE must contain the name of the output file you’re writing to. When the -c and -d options of the tr command are used in combination like this, the only characters tr writes to the standard output stream are the characters you specify on the command line.

Using octal characters with ‘tr’

Although it may not look very attractive, I’m using octal characters in the tr command to make the programming job easier and more efficient. This command tells tr to retain only the octal characters (a) 11, (b) 12, and (c) 40 through 176 when writing to standard output. Octal character 11 corresponds to the [Tab] character, and octal 12 corresponds to the [Linefeed]character. The octal characters 40 through 176 correspond to the standard visible keyboard characters, beginning with the [Space] character (octal 40) through the ~ character (octal 176). These are the only characters retained by tr — the rest are filtered out, leaving you with a clean ASCII file.

Remove unprintable character sequences with this Perl command

I recently had a file that contained content that looks like this:

^[[33mpackage ^[[0m<empty> {
  ^[[33mimport ^[[0msys.process.*

Because all of those ^[[ control sequences are made from multiple characters, my tr command approach won’t work. Fortunately I found this Perl command that does work to remove those character sequences:

perl -pe 's/\x1b\[[0-9;]*[mG]//g' INFILE > OUTFILE

I found that Perl command on this page.

Final thoughts

If you haven’t used the tr command before, I hope this tutorial has been helpful. As mentioned, the tr command is a filter that helps you transform input streams into desired output streams. While this is good, for other, more complicated scenarios you may need to use the Unix/Linux sed command. sed stands for “streamline editor,” and also lets you convert/transform input streams to desired output streams. For more information on sed, see this link that of Unix sed tutorials on this website.