A Drupal 8 XML Sitemap generating PHP script

I’m not going to comment on the following code too much or provide support for it, but (a) if you need to create an XML Sitemap for a Drupal 8 website, and (b) you don’t like the Drupal 8 sitemap modules that are available, then (c) this PHP script can serve as a starting point for you.

I wrote an earlier version of this script for Drupal 6 when I didn’t like the sitemap modules that were available at that time, but it looks like there are sitemap modules for Drupal 8 these days. Probably the only benefit of this script is that you can see how to generate sitemap files from the Drupal 8 database tables, and you can customize it as desired, assuming that you’re comfortable with PHP and SQL queries.

Warnings about the code

As one warning, the code is crappy. I wrote it in a hurry, and because I don’t make it available via any external URLs — I don’t keep it under any DOCROOT — I don’t concern myself with security issues. I run this via a crontab entry a couple of times a day.

As another warning, you do need to know PHP and a little SQL because I do a couple of things to weed out content I don’t want appearing in my sitemap.xml files. For instance, I have content types named blog, photod8, text, etc., and I DO want those to appear in the output files, but I DON’T want other things to appear in the sitemap files, such as “category” and “taxonomy” nodes.

Also, I left a lot of debug print statements in the code. Just uncomment those lines to see some intermediate output as you’re trying to get the script to work for your needs.

It creates a series of sitemap.xml files

When I run this script from the command line, it creates a series of files in the current directory that are named like this:

sitemap1.xml
sitemap2.xml
sitemap3.xml
...

It puts 1,000 records in the first file, and then if there are more, it puts up to another 1,000 in the second file, etc., with 1,000 records per file. If you want more or less than that, modify the file-writing code at the end of the script.

A master sitemap.xml file

Because of the way this works, you’ll need a “master” sitemap file that includes/references these additional files. That master sitemap file should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
  http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd">
  <sitemap>
    <loc>http://alvinalexander.com/sitemap0.xml</loc>
    <lastmod>2016-05-23</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://alvinalexander.com/sitemap1.xml</loc>
    <lastmod>2016-05-23</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://alvinalexander.com/sitemap2.xml</loc>
    <lastmod>2016-05-23</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://alvinalexander.com/sitemap3.xml</loc>
    <lastmod>2016-05-23</lastmod>
  </sitemap>
</sitemapindex>

I use another script to generate that master file. (Sorry, I don’t include that script here.)

My Drupal 8 sitemap generator

Without any further ado or explanation, here’s the source code for the PHP script I use to generate the sitemap.xml files for this website.

#!/usr/bin/php -q
<?php

    class Comment {

        var $cid;      // id
        var $nid;      // nid this comment belongs to
        var $time;     // time comment was created or last edited
        var $status;   // 0 = published
  
        function Comment($cid, $nid, $time, $status) {
            $this->cid = $cid;
            $this->nid = $nid;
            $this->time = $time;
            $this->status = $status;
        }
    
        function print_as_string() {
            printf("%s, %s, %s, %s\n", $this->cid, $this->nid, $this->time, $this->status);
        }
  
    }


    class Blog {
  
        // from url_alias table
        var $src;
        var $dst;
        var $node_id;
    
        // from node table
        var $type;
        var $status;
        var $last_mod;
      
        function Blog($src, $dst) {
            $this->src = $src;
            $this->dst = $dst;
      
            # d8: need to change this b/c this field now begins with a '/'
            #     (did not begin with one before)
            # populate `node_id` from src (`src` format is now "/node/55")
            $srcTmp = substr($src, 1);
            $parts = explode("/", $srcTmp);
            $this->node_id = $parts[1];
        }
    
        function print_as_string() {
            printf("%s, %s, %s\n", $this->node_id, $this->src, $this->dst);
        }
    
        # d8: updated to account for photod8, text, and misc
        function print_rec($fp) {
            #echo "TYPE = " . $this->type . "\n";
            if ($this->status != 1) return;
            if ($this->type != 'blog'        &&
                $this->type != 'bookmark'    && 
                $this->type != 'photod8'     && 
                $this->type != 'source_code' && 
                $this->type != 'text'        &&
                $this->type != 'misc'        &&
                $this->type != 'inspirational_quote') return;
      
            # desired date format: 2009-08-28
            $the_date = date('Y-m-d', $this->last_mod);
            fwrite($fp, sprintf("  <url>\n") );
            fwrite($fp, sprintf("    <loc>http://alvinalexander.com%s</loc>\n", $this->dst) );
            fwrite($fp, sprintf("    <lastmod>%s</lastmod>\n", $the_date) );
            fwrite($fp, sprintf("    <changefreq>weekly</changefreq>\n") );
            fwrite($fp, sprintf("    <priority>0.6</priority>\n") );
            fwrite($fp, sprintf("  </url>\n") );
        }
    }


    class Node {

        var $nid;
        var $type;
        var $status;
        var $changed;
    
        function Node($nid, $type, $status, $changed) {
            $this->nid = $nid;
            $this->type = $type;
            $this->status = $status;
            $this->changed = $changed;
        }

        function print_as_string() {
            printf("%s, %s, %s, %s\n", $this->nid, $this->type, $this->status, $this->changed);
        }

    }


    //----------------------------------------------
    // returns a newer timestamp, if one is found;
    // otherwise, it returns $curr_date.
    //
    // requires $comments to be global
    //----------------------------------------------
    function get_most_recent_date_for_node($node_id, $curr_date) {
        // must declare that i want to access $comments from the enclosing script
        global $comments;
      
        $num_comments = count($comments);
        $new_date = $curr_date;
      
        # loop thru each comment, looking for a newer date
        for ($k = 0; $k < $num_comments; $k++) {
            $curr_comment = $comments[$k];
        
            # this test works b/c the query is ordered by nid asc
            if ($curr_comment->nid > $node_id) break;
        
            # if we're looking at the same node-id, and the comment timestamp
            # is newer, use it
            if ($curr_comment->nid == $node_id && $curr_comment->time > $new_date) {
                $new_date = $curr_comment->time;
            }
        }
      
        return $new_date;
    }


$header = <<<HEADER
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
  http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

HEADER;

$footer = <<<FOOTER
</urlset>

FOOTER;


    #--------#
    #  MAIN  #
    #--------#
  

    #----------------------------
    # (1) connect to the database
    #----------------------------

    $link = mysql_connect('localhost', 'd8_user', 'd8_pass');
    if (!$link) {
        die('Could not connect: ' . mysql_error());
    }
    mysql_select_db('aa_d8'); 
  



    #-----------------------------------------------------------
    # (2) create a list of blog objects from the url_alias table
    # field names changed from d6 to d8 (old = (src,dst))
    # `alias` field begins with `/`. `dst` did not.
    #-----------------------------------------------------------
    $query = "select source, alias from url_alias";
    $result = mysql_query($query);
  
    $blogs = array();
    while ($row = mysql_fetch_array($result, MYSQL_BOTH)) {
        $b = new Blog($row[0], $row[1]);
        #$b->print1();
    
        # skip entries where the alias begins with these strings.
        # it's important that these strings match position-0 of the dst/alias.
        # `0` is expected on matches, and is used in the loop below.
        $pos1 = strpos($b->dst, '/users/');
        $pos2 = strpos($b->dst, '/taxonomy/');
        $pos3 = strpos($b->dst, '/category/');

        #echo "\n";
        #echo "DEST: $b->dst\n";
        #echo "POS1: $pos1,  POS2: $pos2,  POS3: $pos3\n";

        # PHP 'false' == 0
        if ($pos1 !== false || $pos2 !== false || $pos3 !== false) { 
            #echo "skipping $b->dst\n";
        } else {
            array_push($blogs, $b);
        }
    }
    


    #---------------------------------------------
    # (3) get all the nodes
    # `status` and `changed` are no longer in node
    #---------------------------------------------
  
    $nodes = array();
    # $q = "select nid, type, status, changed from node order by changed desc";            # d6
    $q = "SELECT nid, type, status, changed FROM node_field_data ORDER BY changed DESC";   # d8
    $result = mysql_query($q);
    while ($row = mysql_fetch_array($result, MYSQL_BOTH)) {
        $n = new Node($row[0], $row[1], $row[2], $row[3]);
        array_push($nodes, $n);
    }

    #foreach ($nodes as $n) {
    #    $n->print_as_string();
    #}

    
  
    #-----------------------------------------------
    # (4) get all the approved comments (status = 0)
    #-----------------------------------------------
    #$cquery = "select cid, nid, timestamp, status from comments where status = 0 order by nid asc";                       # d6
    $cquery = "SELECT cid, entity_id, changed, status FROM comment_field_data WHERE status = 0 ORDER BY entity_id ASC";    # d8
    $result = mysql_query($cquery);
  
    $comments = array();
    while ($row = mysql_fetch_array($result, MYSQL_BOTH)) {
        $c = new Comment($row[0], $row[1], $row[2], $row[3]);
        array_push($comments,$c);
    }

    #foreach ($comments as $c) {
    #    $c->print_as_string();
    #}



    # ---------------------------------------------------------------------
    # (5) create a new blogs array that is in the sorted order
    # ---------------------------------------------------------------------
    # we now have these arrays: $blogs, $nodes, $comments.
    # i can control the order of the nodes array, so use it to
    # create a new blogs array we can work with.
    # ---------------------------------------------------------------------

    $nodecount = count($nodes);
    $blogcount = count($blogs);

    #echo "nodecount = $nodecount\n";    #  7215
    #echo "blogcount = $blogcount\n";    # 11294

    # the new array this loop populates
    $sorted_blogs = array();
  
    for ($i = 0; $i < $nodecount; $i++) {
        $curr_node = $nodes[$i];
        # TODO this loop is slow. can i do the same thing with a sql query?
        for ($j = 0; $j < $blogcount; $j++) {
            $b = $blogs[$j];
            # echo "  blogNid = $b->node_id\n";
            if ($curr_node->nid == $b->node_id) {
                #echo "   found a match at $curr_node->nid\n";
                $b->type     = $curr_node->type;
                $b->status   = $curr_node->status;
                $b->last_mod = $curr_node->changed;

                // new: get the newest possible date for the node by looking at the comment dates
                $b->last_mod = get_most_recent_date_for_node($curr_node->nid, $b->last_mod);
        
                array_push($sorted_blogs, $b);
                break;
            }
        } // end blogcount loop
    } // end nodecount loop
    
    
    #foreach($sorted_blogs as $b) {
    #    $b->print_as_string();
    #}
    #exit;    


    # don't need these any more, might help to free up their ram
    unset($blogs);
    unset($nodes);
    unset($comments);
  
  
    # ---------------------------------------------------------------------
    # need to sort the array here, b/c the comment dates need to be
    # factored in as well
    # ---------------------------------------------------------------------
  
    function cmp($a, $b) {
        if ($a->last_mod == $b->last_mod) {
            return 0;
        }
        return ($a->last_mod < $b->last_mod) ? 1 : -1;
    }
  
    usort($sorted_blogs, "cmp");
  
    
    # ---------------------------------------------------------------------
    # print the records
    # ---------------------------------------------------------------------
    # now have a new $sorted_blogs array, hopefully in the correct, sorted order.
    # now print the $sorted_blogs records.
    # ---------------------------------------------------------------------
  
    # start printing to the first file, then switch to a different file
    # when you put as many records as you want into this one
    # (typically 1,000 records per file).
  
    $count = count($sorted_blogs);
    $recs_printed = 0;
  
    $fp = fopen("sitemap1.xml", 'w');
    fwrite($fp, $header);
  
    for ($i = 0; $i < $count; $i++) {
        if ($recs_printed % 1000 == 0) {
            // close out the old file
            fwrite($fp, $footer);
            fclose($fp);
            // create the new filename
            $res = $recs_printed / 1000;             // ex: 1000/1000 = 1
            $res++;                                  // ex: 2
            $filename = 'sitemap' . $res . '.xml';   // ex: sitemap2.xml
            // open and write to the new file
            $fp = fopen($filename, 'w');
            fwrite($fp, $header);
        }
 
        # D8: changed 'photo' to 'photod8'. also added 'text'.
        $b = $sorted_blogs[$i];
        if ($b->status == 1 && ($b->type == 'blog'        || 
                                $b->type == 'bookmark'    ||
                                $b->type == 'photod8'     ||
                                $b->type == 'source_code' ||
                                $b->type == 'text'        ||
                                $b->type == 'misc'        ||
                                $b->type == 'inspirational_quote')) {
            # added if condition on 2012/07/16 to avoid printing bad urls to sitemap files
            # returns false is $b->dst contains one or more non-ascii characters
            if (mb_detect_encoding($b->dst, 'ASCII', true)) {
                $b->print_rec($fp);
                $recs_printed++;
            } else {
                #echo "bad dst, not used: " . $b->dst . "\n";
            }
        }
  
    } # end blog for loop
  
    fwrite($fp, $footer);
    fclose($fp);
  
  
    // (X) free up the database resources
    mysql_free_result($result);
    mysql_close($link);
  
    #print_r($a);

?>

Summary

As mentioned, the code is pretty crappy. I only write about 1,000 lines of PHP a year, and most of it is in a rush, so it is what it is.