Linux sed command: Use sed and wc to count leading blanks in a file

Way back in the day — pre-2007 — I used JSPs and servlets to generate a lot of the pages around here, and today I looked at how many blank spaces and blank lines are generated by the JSP's. I don't think I can do much about the blank lines (actually, I just haven't looked into it yet), but about those blanks spaces ...

Out of curiosity I decided to look at this -- how many blank spaces are there at the beginning of lines that I could delete just through formatting? Would deleting those characters help reduce my bandwidth costs (at the expense of slightly uglier JSP's)?

I thought about writing a Ruby script to get it right, but I've been working with sed so much lately I thought I'd just give it a try. So, any further introduction I think this sed script is very close to giving me what I want -- a count of the number of blank spaces at the beginning of all lines in a sample HTML file:

# 1. delete blank lines
/^$/d

# 2. delete lines beginning with a tab
/^  /d

# 3. delete lines beginning w/ any alpha characters, <, or %
/^[a-zA-Z\<\%]/d

# 4. find lines beginning w/ one or more blanks, then print only
#    the blanks
/^  */ {
        s/^\(  *\).*/\1/
}

# 5. delete all lines that just have ^M (need to do ^V ^M trick here)
/^^M$/d

As you can see from the five comments, these commands will (1) delete all completely blank lines from the output stream; (2) delete all lines beginning with a [Tab] character; (3) delete all lines beginning with alpha characters, the '<' character, or '%'; (4) then find all lines beginning with one or more blanks, and printing only the blanks from that line; (5) removing the '^M' character that may be at the end of lines.

Naming this file "leadingblanks.sed" I run it like this, piping the output into the wc command:

sed -f leadingblanks.sed < mySampleFile.html | wc

which leads to output like this:

358       0    2967

This output means that wc found 358 lines in the stream from sed, and in that stream there were 2,967 characters, in my case, all blanks. (I may be wrong here, there may actually be [2,967 minus 358] blank spaces, but I really don't care, this is close enough for today.)

So, as a quick summary, the JSP file that I looked at is printing as many as 2,967 extra blank spaces at the beginning of the lines it outputs -- meaning my HTML files are that much larger than they have to be -- and I can easily delete most of these characters if I want to use this as a means of reducing my bandwidth bill (and arguably making your page load that much faster).

Assuming an average page has 10,000 characters (which is close), I can reduce the bandwidth of this particular page by as much as 29.7%.