Friday, December 16, 2011

RPM Changelogs for Recent Updates 10x Faster

In my previous post I presented a generalised rpm changelog summary script. I've now tidied up the implementation and added a couple of new options.

One thing bugging me was that the script exec'ed rpm for each package. Even though UNIX process creation is relatively inexpensive, the programs being exec'ed take time to initialise themselves, they have to open files, read configs, create internal structures, etc. The cumulative initialisation costs can be substantial. For example, the old makewhatis script that used to ship which many Linux distro's exec'ed gawk for every manual page, this took 30 minutes on a 486DX66. It was so annoying I rewrote it to exec gawk less often, and the the run time dropped to 1.5 minutes. The improved version is still included man-1.6g. Given how many machines were once running this script, the reduction in Carbon emissions may have been significant ;-)

By taking advantage of rpm's --queryformat option I've changed the rpmChangelogs script to exec rpm for 100 rpm arguments at a time.  This is about 10 times faster for large runs.  For example, when I generated a summary dating back to my upgrade from OpenSUSE 11.4 to 12.1, the run time reduced from about 50 seconds down to 5 seconds.

I've added an option to include the description of the package. And I've added and option to accept the rpm names from the command line instead of just doing the most recently installed ones.

Here is the syntax summary for the new version:

python rpmChangelogs.py -h
Usage: rpmChangelogs.py [options] [rpm...] 

Report change log entries for recently installed (-i) rpm's or for the rpm's
specified on the command line.

Options:
  -h, --help            show this help message and exit
  -i INSTALLDAYS, --installed-since=INSTALLDAYS
                        Include anything installed up to INSTALLDAYS days ago.
  -c CHANGEDAYS, --changed-since=CHANGEDAYS
                        Report change log entries from up to CHANGEDAYS days
                        ago.
  -d, --description     Include each rpm's description in the output.

Except for the optional addition of the description, the output is the same as the previous OpenSUSE only script.
My python is a little rusty - I just spent months doing Java - so I've also gone back over it and tried to tidy up the code.

The code


(Once you've expanded the code, hover over the code area to bring up options that make it easier to copy or print - requires javascript to be enabled.)
#!/usr/bin/env python
#
# rpmChangelogs.py 
#
# Copyright (C) 2011: Michael Hamilton
# The code is GPL 3.0(GNU General Public License) ( http://www.gnu.org/copyleft/gpl.html )
# 
# Updated 2013/03/18: now uses seconds from 1970 to avoid localisation issues with the dates output by rpm.
#
import subprocess
from datetime import datetime,  timedelta
from optparse import OptionParser

maxArgsPerCommand=100

optParser = OptionParser(
            usage='Usage: %prog [options] [rpm...] ', 
            description="Report change log entries for recently installed (-i) rpm's or for the rpm's specified on the command line.")
optParser.add_option('-i',  '--installed-since',  dest='INSTALLDAYS', type='int', default=1,  help='Include anything installed up to INSTALLDAYS days ago.')
optParser.add_option('-c',  '--changed-since',  dest='CHANGEDAYS', type='int', default=60,  help='Report change log entries from up to CHANGEDAYS days ago.')
optParser.add_option('-d',  '--description',  dest='DESC', action='store_true', default=False,  help="Include each rpm's description in the output.")
(options, args) = optParser.parse_args()

installedSince = datetime.now() - timedelta(days=options.INSTALLDAYS)
changedSince = datetime.now() - timedelta(days=options.CHANGEDAYS)
showDesc = options.DESC

if len(args) > 0:
    recentPackages = args
else:
    queryProcess = subprocess.Popen(['rpm', '-q', '-a', '--last'], shell=False, stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
    recentPackages = []
    for queryLine in queryProcess.stdout:
        (name, dateStr) = queryLine.split(' ', 1)
        installDatetime = datetime.strptime(dateStr.strip(), '%a %d %b %Y %H:%M:%S %Z')
        if installDatetime < installedSince:
            break
        recentPackages.append(name)
    queryProcess.stdout.close()
    queryProcess.wait()
    if queryProcess.returncode != 0:
        print '*** ERROR (return code was ', queryProcess.returncode,  ')'
    for line in queryProcess.stderr:
        print line, 

# Use one rpm exec to query multiple packages - 10x faster than an exec for each one
marker = '+Package: '
markerLen = len(marker)
for subset in [recentPackages[i:i+maxArgsPerCommand] for i in range(0, len(recentPackages), maxArgsPerCommand)]:
    format = marker + '%{INSTALLTIME} %{NAME}-%{VERSION}-%{RELEASE}\n' + ('%{DESCRIPTION}\n\n+Changelog:\n' if showDesc else '')
    rpmProcess = subprocess.Popen(['rpm', '-q', '--queryformat=' + format, '--changelog'] + subset, shell=False, stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
    tooOld = False
    for line in rpmProcess.stdout:
        if line.startswith(marker):
            installedDate = datetime.fromtimestamp(float(line[markerLen:line.rfind(' ')]))
            name = line.rsplit(' ',  1)[1]
            print '=================================================='
            print marker,  installedDate, name, 
            print '------------------------------'
            tooOld = False
        else:
            if line.startswith('* ') and len(line) > 17:
                try:
                    changeDate = datetime.strptime(line[:line.rfind(' ')], '* %a %b %d %Y')
                    tooOld = changeDate < changedSince
                except ValueError:
                    pass # not a date - move on
            if not tooOld: 
                print line, 
    rpmProcess.stdout.close()
    rpmProcess.wait()
    if rpmProcess.returncode != 0:
        print '*** ERROR (return code was ', rpmProcess.returncode,  ')'
    for line in rpmProcess.stderr:
        print line, 
    rpmProcess.stderr.close()

3 comments:

  1. In OpenSUSE 12.2, on launching the above script using the command line:

    ./rpmChangelogs.py -i 2 xorg-x11

    I get:

    Traceback (most recent call last):
    File "./rpmChangelogs.py", line 53, in
    installedDate = datetime.strptime(line[markerLen:line.rfind(' ')], '%a %d %b %Y %H:%M:%S %Z')
    File "/usr/lib64/python2.7/_strptime.py", line 325, in _strptime
    (data_string, format))
    ValueError: time data 'Wed 02 Jan 2013 11:10:40 AM CET' does not match format '%a %d %b %Y %H:%M:%S %Z'

    ReplyDelete
    Replies
    1. The date format for the time string is is slightly off, it should look like
      '%a %d %b %Y %H:%M:%S %p %Z'

      Delete
  2. Sorry about the delay in replying - I was away for the past few weeks. Joel Matz has zeroed in on the cause of the problem. It looks like your 12.2 is formatting dates with AM/PM - my 12.2 is using 24 hour format and omitting the AM/PM. I guess this is controlled by an operating system localisation setting. I have come up with a more portable solution that ignores date localisation.

    Change the scripts reference to INSTALLTIME:date to just INSTALLTIME which will result in the rpm command returning seconds since 1970. So format becomes:

    format = marker + '%{INSTALLTIME} %{NAME}-%{VERSION}-%{RELEASE}\n' + ('%{DESCRIPTION}\n\n+Changelog:\n' if showDesc else '')

    Then the installed date can be parsed by changing the installedDate assignment to

    installedDate = datetime.fromtimestamp(float(line[markerLen:line.rfind(' ')]))

    This works on my install of 12.2. I will update the script in the blog post shortly.

    ReplyDelete