When working with text files on a Unix/Linux system, you'll occasionally run into a situation where a file will contain extended ASCII characters. These extended characters will generally appear to begin with ^
or [
characters in your text files. For instance, the vi/vim editor will show ^M
characters in DOS text files when they are transferred to Unix systems, such as when using the ftp command in binary transfer mode. Oftentimes, you'll want to easily delete these characters from your files.
Having run into this problem throughout the years, I created a simple little script to remove these extended characters from my text files. The guts of the program is a one-line tr
command that prints only the characters I allow it to print, and removes all other characters.
The Unix/Linux “tr” command
If you haven't used the Unix tr
command before, you'll find that it’s an interesting utility that lets you translate characters in the tr
standard input stream into different characters in the tr
standard output stream. The tr
command is one of the true “filters” in the Unix operating system, because it works only on input/output streams, and not on files.
A simple example of the tr
command is shown below. This example converts the phrase hello world
into jello world
by replacing the letter h
in the input stream with the letter j
in the output stream:
$ echo "hello world" | tr h j jello world
As a second example, the tr
command can also be used to delete characters as they are read in from the input stream and written to the output stream. For instance, the following command converts the word fred
in the input stream into the word red
in the output stream, by deleting the letter f
in the translation process:
$ echo "fred" | tr -d f red
The -d
flag is what tells tr
to delete the characters you supply.
Removing all undesirable characters at once
In the shell script I use to remove all non-printable ASCII characters from a text file, I tell the tr
command that in its translation process it should delete every character in the input stream except for the characters I specify. In essence, I filter out the undesirable characters. The tr
command I use in that script is shown below:
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
In this command, the variable INPUT_FILE
must contain the name of the Unix file you’re reading from, and OUTPUT_FILE
must contain the name of the output file you’re writing to. When the -c
and -d
options of the tr
command are used in combination like this, the only characters tr
writes to the standard output stream are the characters you specify on the command line.
Using octal characters with ‘tr’
Although it may not look very attractive, I’m using octal characters in the tr
command to make the programming job easier and more efficient. This command tells tr
to retain only the octal characters (a) 11
, (b) 12
, and (c) 40
through 176
when writing to standard output. Octal character 11
corresponds to the [Tab]
character, and octal 12
corresponds to the [Linefeed]
character. The octal characters 40
through 176
correspond to the standard visible keyboard characters, beginning with the [Space]
character (octal 40) through the ~
character (octal 176). These are the only characters retained by tr
— the rest are filtered out, leaving you with a clean ASCII file.
Remove unprintable character sequences with this Perl command
I recently had a file that contained content that looks like this:
^[[33mpackage ^[[0m<empty> { ^[[33mimport ^[[0msys.process.*
Because all of those ^[[
control sequences are made from multiple characters, my tr
command approach won’t work. Fortunately I found this Perl command that does work to remove those character sequences:
perl -pe 's/\x1b\[[0-9;]*[mG]//g' INFILE > OUTFILE
I found that Perl command on this superuser.com page.
Final thoughts
If you haven’t used the tr
command before, I hope this tutorial has been helpful. As mentioned, the tr
command is a filter that helps you transform input streams into desired output streams. While this is good, for other, more complicated scenarios you may need to use the Unix/Linux sed
command. sed
stands for “streamline editor,” and also lets you convert/transform input streams to desired output streams. For more information on sed
, see this link that of Unix sed tutorials on this website.