A Ruby script to remove binary (garbage) characters from a text file

Problem: You have a file that should be a plain text file, but for some reason it has a bunch of non-printable binary characters (also known as garbage characters) in it, and you'd like a Ruby script that can create a clean version of the file.

Solution: I've demonstrated how to do this in another blog post by using the Unix tr command, but in case you'd like a Ruby script to clean up a file like this, I thought I'd write up a quick program and share it here.

To that end, here's the source code for a Ruby script that reads a given input file, goes through each character in the file, and only outputs valid, printable, ASCII characters to standard output:

#!/usr/bin/ruby

#-------------------------------------------------------------------
#
# Program: PrintableCharsOnly.rb
#
# Purpose: A Ruby script that takes a file as input, and strips out 
#          all the undesirable characters from that file, and prints 
#          out only "good" ASCII characters, i.e., more or less all 
#          the keyboard characters, including TAB, newline, and 
#          carriage return.
#
# Author:  alvin alexander, devdaily.com
#
#-------------------------------------------------------------------

# bail out unless we get the right number of command line arguments
unless ARGV.length == 1
  puts "Dude, not the right number of arguments."
  puts "Usage: ruby PrintableCharsOnly.rb YourInputFile > YourOutputFile\n"
  exit
end

# get the input filename from the command line
file = ARGV[0]

# open the file
File.readlines(file).each do |line|
  line.each_byte { |c|
    # only print the ascii characters we want to allow
    print c.chr if c==9 || c==10 || c==13 || (c > 31 && c < 127)
  }
end

Discussion

As you can see from the source code shown above, this Ruby script only prints the following characters (or byte code values) on standard output:

byte/decimal value 9: tab character
byte/decimal value 10: linefeed
byte/decimal value 13: carriage return
byte/decimal value 140 through octal 176: all the "good" keyboard characters 

For more information on ASCII characters

For more information on ASCII characters check out the ASCII character tables at either of these sites:

Add new comment

The content of this field is kept private and will not be shown publicly.

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.