Posts in the “linux-unix” category

Sorting Unix 'ls' command output by filesize

I just noticed that some of the MySQL files on this website had grown very large, so I wanted to be able to list all of the files in the MySQL data directory and sort them by filesize, with the largest files shown at the end of the listing. This ls command did the trick, resulting in the output shown in the image:

ls -Slhr

The -S option is the key, telling the ls command to sort the file listing by size. The -h option tells ls to make the output human readable, and -r tells it to reverse the output, so in this case the largest files are shown at the end of the output.

Linux shell script date formatting

Unix/Linux date FAQ: How do I create a formatted date in Linux? (Or, “How do I create a formatted date I can use in a Linux shell script?”)

I just ran into a case where I needed to create a formatted date in a Linux shell script, where the desired date format looks like this:

2010-07-11

To create this formatted date string, I use the Linux date command, adding the + symbol to specify that I want to use the date formatting option, like this:

Linux: How to find multiple filenames with the ‘find’ command

Unix/Linux find command FAQ: How can I write one Unix find command to find multiple filenames (or filename patterns)? For example, I want to find all the files beneath the current directory that end with the file extensions ".class" and ".sh".

You can use the Linux find command to find multiple filename patterns at one time, but for most of us the syntax isn't very common. In short, the solution is to use the find command's "or" option, with a little shell escape magic. Let's take a look at several examples.

Linux: How to get the basename from the full filename

As a quick note today, if you’re ever writing a Unix/Linux shell script and need to get the filename from a complete (canonical) directory/file path, you can use the Linux basename command like this:

$ basename /foo/bar/baz/foo.txt
foo.txt

How to make an offline mirror copy of a website with wget

As a short note today, if you want to make an offline copy/mirror of a website using the GNU/Linux wget command, a command like this will do the trick for you:

wget --mirror            \
     --convert-links     \
     --html-extension    \
     --wait=2            \
     -o log              \
     http://howisoldmybusiness.com

Update: One thing I learned about this command is that it doesn’t make a copy of “rollover” images, i.e., images that are changed by JavaScript when the user rolls over them. I haven’t investigated how to fix this yet, but the easiest thing to do is to copy the /images directory from the server, assuming that you’re making a static copy of your own website, as I am doing. Another thing you can do is manually download the rollover images.

Why I did this

In my case I used this command because I don’t want to use Drupal to serve that website any more, so I used wget to convert the original Drupal website into a series of static HTML files that can be served by Nginx or Apache. (There’s no need to use Drupal here, as I no longer update that website, and I don’t accept comments there.) I just did the same thing with my alaskasquirrel.com website, which is basically an online version of a children’s book that I haven’t modified in many years.

Why use the --html-extension option?

Note that you won’t always need to use the --html-extension option with wget, but because the original version of my How I Sold My Business website did not use any extensions at the end of the URLs, it was necessary in this case.

What I mean by that is that the original version of my website had URLs like this:

http://howisoldmybusiness.com/content/friday-october-18-2002

Notice that there is no .html extension at the end of that URL. Therefore, what happens if you use wget without the --html-extension option is that you end up with a file on your local computer with this name:

content/friday-october-18-2002

Even if you use MAMP or WAMP to serve this file from your local filesystem, they aren’t going to know that this is an HTML file, so essentially what you end up with is a worthless file.

Conversely, when you do use the --html-extension option, you end up with this file on your local filesystem:

content/friday-october-18-2002.html

On a Mac, that file is easily opened in a browser, and you don’t even need MAMP. wget is also smart enough to change all the links within the offline version of the website to refer to the new filenames, so everything works.

Explanation of the wget options used

Here’s a short explanation of the options I used in that wget command:

--mirror
    Turn on options suitable for mirroring. This option turns on 
    recursion and time-stamping, sets infinite recursion depth,
    and keeps FTP directory listings. It is currently equivalent to 
    ‘-r -N -l inf --no-remove-listing’. 

--convert-links
    After the download is complete, convert the links in the document
    to make them suitable for local viewing.

--html-extension

-o foo
    write "log" output to a file named "foo"

--wait=seconds
    Wait the specified number of seconds between the retrievals.
    Use of this option is recommended, as it lightens the server load 
    by making the requests less frequent.

Depending on the web server settings of the website you’re copying, you may also need to use the -U option, which works something like this:

-U Mozilla
   mascarade as a Mozilla browser

That option lets you set the wget user agent. (I suspect that the string you use may need to be a little more complicated than that, but I didn’t need it, and didn’t investigate it further.)

I got most of these settings from the GNU wget manual.

Update

An alternative approach is to use httrack, like this:

httrack --footer "" http://mywebsite:8888/

I’m currently experimenting to see which works better.

Summary

I’ll write more about wget and its options in a future blog post, but for now, if you want to make an offline mirror copy of a website, the wget command I showed should work.

Linux sed command: Use sed and wc to count leading blanks in a file

Way back in the day — pre-2007 — I used JSPs and servlets to generate a lot of the pages around here, and today I looked at how many blank spaces and blank lines are generated by the JSP's. I don't think I can do much about the blank lines (actually, I just haven't looked into it yet), but about those blanks spaces ...

Out of curiosity I decided to look at this -- how many blank spaces are there at the beginning of lines that I could delete just through formatting? Would deleting those characters help reduce my bandwidth costs (at the expense of slightly uglier JSP's)?

I thought about writing a Ruby script to get it right, but I've been working with sed so much lately I thought I'd just give it a try. So, any further introduction I think this sed script is very close to giving me what I want -- a count of the number of blank spaces at the beginning of all lines in a sample HTML file:

# 1. delete blank lines
/^$/d

# 2. delete lines beginning with a tab
/^  /d

# 3. delete lines beginning w/ any alpha characters, <, or %
/^[a-zA-Z\<\%]/d

# 4. find lines beginning w/ one or more blanks, then print only
#    the blanks
/^  */ {
        s/^\(  *\).*/\1/
}

# 5. delete all lines that just have ^M (need to do ^V ^M trick here)
/^^M$/d

As you can see from the five comments, these commands will (1) delete all completely blank lines from the output stream; (2) delete all lines beginning with a [Tab] character; (3) delete all lines beginning with alpha characters, the '<' character, or '%'; (4) then find all lines beginning with one or more blanks, and printing only the blanks from that line; (5) removing the '^M' character that may be at the end of lines.

Naming this file "leadingblanks.sed" I run it like this, piping the output into the wc command:

sed -f leadingblanks.sed < mySampleFile.html | wc

which leads to output like this:

358       0    2967

This output means that wc found 358 lines in the stream from sed, and in that stream there were 2,967 characters, in my case, all blanks. (I may be wrong here, there may actually be [2,967 minus 358] blank spaces, but I really don't care, this is close enough for today.)

So, as a quick summary, the JSP file that I looked at is printing as many as 2,967 extra blank spaces at the beginning of the lines it outputs -- meaning my HTML files are that much larger than they have to be -- and I can easily delete most of these characters if I want to use this as a means of reducing my bandwidth bill (and arguably making your page load that much faster).

Assuming an average page has 10,000 characters (which is close), I can reduce the bandwidth of this particular page by as much as 29.7%.

vim tip: How to configure vim autoindent

vim autoindent FAQ: How do I configure vim to automatically indent newlines? That is, if my current line is indented three spaces, and I hit [Enter], I want the next line to autmatically be indented three spaces as well.

To configure vim autoindent, just use this vim command:

The vim “delete line” command

vim delete FAQ: How do I delete a line in vim? (Also, how do I delete multiple lines in vim?)

To delete the current line in your vim editor, use this command:

dd

You can use this same command to delete multiple lines in vim. Just precede the command by the number of lines you want to delete. For instance, to delete five lines in vim, use this command:

vi/vim video tutorials

Woo-hoo, I've always wanted to create a vim video tutorial series, and now that I have the software to do it, I'm finally embarking on this adventure.

My vi/vim editor video tutorial - Lesson
1, Introduction

Installing Wiki.js on Ubuntu 20.04, with Postgresql

This probably won’t make sense to anyone else, but these are my notes related to installing Wiki.js and Postgresql on an Ubuntu 20.04 system. Everything here is related to setting up a new Ubuntu system and then running Wiki.js:

A Linux crontab mail command example

Linux crontab mail FAQ: Can you share an example of a Linux crontab entry you use to send email on a regular basis?

Solution: Here’s the source code for a really simple Linux mail script that I used to send an email message to one of my co-workers every month. This script used the Unix or Linux mail command to email a file to her that showed a list of all the websites on our server that she needed to bill our customers for.

An example Linux crontab file

Linux crontab format FAQ: Do you have an example of a Unix/Linux crontab file format?

I have a hard time remembering the crontab file format, so I thought I’d share an example crontab file here today. The following file is the root crontab file from a CentOS Linux server I use in a test environment.

A sed command to display non-visible characters in a text file

I just ran into a need to see what non-printable (non-visible?) characters were embedded in a text file in a Unix system, when I remembered this old sed command:

sed -n 'l' myfile.txt

Note that the character in that sed command is a lower-case letter "L", and not the number one ("1").

This command shows the contents of your file, and displays some of the nonprintable characters with the octal values. On some systems tab characters may also be shown as ">" characters.

Shell script error: bad interpreter - No such file or directory

Some times when you take a file from a DOS/Windows system and move it to a Linux or Unix system you'll have problems with the dreaded ^M character. This happened recently when I moved an Ant script from a Windows system to my Mac OS X system. When I tried to run the shell script under the Mac Terminal I got this "bad interpreter" error message:

How to remove extended ASCII characters from Unix files with the 'tr' command

When working with text files on a Unix/Linux system, you'll occasionally run into a situation where a file will contain extended ASCII characters. These extended characters will generally appear to begin with ^ or [characters in your text files. For instance, the vi/vim editor will show ^M characters in DOS text files when they are transferred to Unix systems, such as when using the ftp command in binary transfer mode. Oftentimes, you'll want to easily delete these characters from your files.

Unix/Linux: Find all files that contain multiple strings/patterns

When using Unix or Linux, if you ever need to find all files that contain multiple strings/patterns, — such as finding all Scala files that contain 'try', 'catch', and 'finally' — this find/awk command seems to do the trick:

find . -type f -name *scala -exec awk 'BEGIN {RS=""; FS="\n"} /try/ && /catch/ && /finally/ {print FILENAME}' {} \;

As shown in the image below, all of the matching filenames are printed out. As Monk says, you’ll thank me later. :)

(I should mention that I got part of the solution from this gnu.org page.)

Update: My File Find utility

For a potentially better solution, see my File Find utility, which lets you search for multiple regex patterns in files.

An awk script to extract source code blocks from Markdown files

I just wrote this awk script to extract all of the Scala source code examples out of a Markdown file. It can easily be converted to extract all of the source code examples out of an Asciidoc file, which is something else I will do with it eventually.

Here’s the awk script:

BEGIN {
    # awk doesn’t have true/false variables, so
    # create our own, and initialize our variable.
    true = 1
    false = 0
    printLine = false
}

{
    # look for ```scala to start a block and ``` to stop a block.
    # `[:space:]*` that is used below means “zero or more spaces”.
    if ($0 ~ /^```scala/) {
        printLine = true
        print ""
    } else if ($0 ~ /^```[:space:]*$/) {
        # if printLine was true, we were in a ```scala block,
        # so print the end matter, then make printLine false
        # so printing will stop
        if (printLine == true) {
            print "```"
        }
        printLine = false
    }
 
    if (printLine) print $0
}