Home

Roy Schestowitz

HTML Syndication

Introduction

Many of us have used or heard of feeds, also referred to as RSS or XML. Content can be delivered in large batches from various source and then fed, in digestable form, to the user who wishes to keep track of change over the Internet. The ability to track changes (or syndicate) a Web site has thus far been limited to sites that provide feeds. Although it is a rising trend, many small-scale, simplistic sites cannot afford the extra complexity. Apparently, the only alternative is to repeatedly navigate to the same site and attempt, as hard as it may seem, to identify changes, additions and news.

This page introduces of a way of converting pages into pseudo-feeds, all for the convenience of the user

Examples of Practical Uses

  • Keeping track of on-line seminar schedules that progressively extend without notification
  • Checking number of downloads, which can then be easily broken down to form a daily calculation
  • Checking for feedback or replies

All the above is, in principle, very reminiscent of the advantages of using feeds (RSS being the technical phrase or jargon).

Tools

The method that follows allows any static Web page (or a page that is generated on-the-fly but cannot be syndicated) to be tracked. The idea is a rather simple one: get a copy of the page every day (or hour, or week) and compare its state with respect to the state of a previous copy. The power of this method is a result of two powerful *NIX tools:

  • wget: downloads pages from a command-line interface
  • diff: checks, in a rather sophisticated way, for differences between two files

With a report of change -- that is the output of the diff command -- changes re-appear in Desktop (or whichever location is more suitable) every morning so the user remains on 'passive mode', waiting for changes in sites to be flagged rather than re-visiting.

Code

As an example, a script is included which scans the Manchester University Web site for news development:

Here is a step-by-step explanation of what the script does (also see notes within the file)

cd /home/roy/Main/Programs/Syndication/

     Explanation: Go to the location of the script, where much of the file handling will be done

echo ''> /home/roy/Desktop/NEW.txt

     Explanation: Clean up the file that summarises previous changes

OLD=mu_old

     Explanation: The file name of the old version of the page

NEW=mu.txt

     Explanation: The file name of the newer version of the page

SITE=http://www.manchester.ac.uk/press

     Explanation: The full address of the page to syndicate

FILENAME=index.html

     Explanation: The file name that is expected to be downloaded from the address above

wget -l0 -H -t0 -nd -N -np $SITE

     Explanation: Download the page (file) from the supplied address

mv $FILENAME $NEW

     Explanation: Rename the downloaded file so that its name becomes more meaningful (avoids repetitions)

diff $OLD $NEW >/home/roy/Desktop/$NEW

     Explanation: Check for differences between the old and the new version and output it to Desktop

mv $NEW $OLD

     Explanation: The version of the page that has just been fetched is now considered old. This re-assures that the next run of the script will have a notion of 'old' data.

echo $NEW':' >> /home/roy/Desktop/NEW.txt

     Explanation: Write the name of the file in the summary (changes index)

test -s /home/roy/Desktop/$NEW

     Explanation: Test to see if the file is empty (indicating no change)

echo $? >> /home/roy/Desktop/NEW.txt

     Explanation: Record the status of the last command

echo ''>> /home/roy/Desktop/NEW.txt
echo ''>> /home/roy/Desktop/NEW.txt
echo ''>> /home/roy/Desktop/NEW.txt

     Explanation: Enter a few blank lines to separate results from different syndication

Further installation and customisation instructions are contained in the example file (see the top part). By default, files will be delivered to Desktop and void files (indicating no changes) will be erased. Index files will be accompanied which indicates (in boolean form) where changes have occurred, thereby providing some overview.

Setting up the Cron Job

Once the script has been customised, its invocation needs to become a part of a cron job so that it runs many times with some pre-set intervals.

For example, take the following job:

39 22 * * * /[path]/get_pages_changes.sh

This will check for differences every night at 10:39 PM. The differences in the pages of interest will still appear in Desktop the following morning. After the users read them, they can be discarded (READ: deleted) to avoid desktop clutter.

The process of setting up a cron job, if you have not done so before, comprises the following simple steps:

  • Set up a file named .cron in your home directory
  • Add the following code to the file:
SHELL=/bin/sh
PATH=/usr/local/bin:/bin:/usr/bin

39 22 * * * /[path]/get_pages_changes.sh
  • Enable the cron job using the following command: crontab ~/.cron
  • To verify that the job has been added to the stack of jobs, you may want to type in: crontab -l and identify the relevant shell script in the list.

If you are still not comfortable with cron jobs and could not set the job above, see some tutorials. The last step, which was listed above, confirms that the job will run overnight. As the example (template) script clearly states, make sure the script is executable and can be accessed by the cron. Its path must not be too restrictive and usually this will be fine by default.

Note: The cron will overwrite the reports once a day, i.e. 24 hours since the last time it was run. Therefore, it is desirable not to set the interval to be too low, e.g. 1 hour. There are ways of extending this script to retain a whole stack of changes over the course of the day, but there are hard to track as they are scattered. They also hammer the sites which possibly do not welcome wget access.

Links

Also see the Google cron page. It provides tools and details on how to keep track of positions in Google, PageRank, etc.


This page was last modified on June 28th, 2005 Maintained by Roy Schestowitz