Developer's Daily Unix by Example
  main | java | perl | unix | dev directory | web log
   
Main
Unix
Education
Unix Articles
   

How to remove extended ASCII characters from 
Unix files

When working with text files on a Solaris system, you'll occasionally run into a situation where a file will contain extended ASCII characters.  These extended characters will generally appear to begin with ^  or [ characters in your text files.  For instance, the vi editor will show ^M characters in DOS text files when they are transferred to Solaris systems using the ftp command in binary transfer mode.  Oftentimes, you'll want to easily delete these characters from your files.

Having run into this problem throughout the years, we created a simple little program to remove these extended characters from our text files.  The guts of the program is a one-line tr command that prints only the characters we tell it to print, and removes all other characters.
 

The tr utility

If you haven't used the tr utility before, you'll find that it's an interesting utility that lets you translate characters in the tr standard input stream into different characters in the tr standard output stream.  The tr command is one of the true "filters" in the Solaris operating system, because it works only on input/output streams, and not on files.

A simple example of the tr command is shown below.  This example converts the phrase "hello world" into "jello world", by replacing the letter 'h' in the input stream with the letter 'j' in the output stream:

 $  echo "hello world" | tr h j
 jello world

As a second example, the tr utility can also be used to delete characters as they are read in from the input stream and written to the output stream.  For instance, the following command converts the word fred in the input stream into the word red in the output stream, by deleting the letter 'f' in the translation process:

 $  echo "fred" | tr -d f
 red
 

Removing all undesirable characters at once

In the shell program we use to remove all non-printable ASCII characters from a text file, we tell the tr command to delete every character in the translation process except for the specific characters we specify.  In essence, we filter out the undesirable characters.  The tr command we use in our program is shown below:
 tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE

In this command, the variable INPUT_FILE must contain the name of the Solaris file you'll be reading from, and OUTPUT_FILE must contain the name of the output file you'll be writing to.  When the -c and -d options of the tr command are used in combination like this, the only characters tr writes to the standard output stream are the characters we've specified on the command line.

Although it may not look very attractive, we're using octal characters in our tr command to make our programming job easier and more efficient.  Our command tells tr to retain only the octal characters 11, 12, and 40 through 176 when writing to standard output.  Octal character 11 corresponds to the [TAB] character, and octal 12 corresponds to the [LINEFEED] character.  The octal characters 40 through 176 correspond to the standard visible keyboard characters, beginning with the [Space] character (octal 40) through the ~ character (octal 176).  These are the only characters retained by tr -- the rest are filtered out, leaving us with a clean ASCII file.
 

Final thoughts

Depending on the file types and terminals you work with, you may need to use more advanced sed commands to filter out undesirable sequences of characters.  However, when you only need to weed out one character at a time, tr is a great little utility to have around.
 

What's related

copyright 1998-2009, devdaily.com, all rights reserved.
devdaily.com, an alvin alexander production.