Home

Roy Schestowitz

Google Cron

What is it for?

Rather than performing numerous Google searches and rather than browsing to determine Pagerank, let Perl summarise large chunks of figures in a single text file. This can be done in a cyclic manner by setting up a cron job.

Installation

Two Perl scripts are necessary (whether you need both depends on what figures you wish to collect from Google):

For the latter, one has to install the WWW::Google::PageRank module. Open a shell as root and type in:

perl -MCPAN -e shell

This essentially configures the CPAN module, which then enables quick installation of new modules. Once CPAN is set up, simply type in:

install WWW::Google::PageRank

Having got the PageRank script (gpr.pl) and the Google Suggest script (gsuggest.pl), set up a file named google, for example, and set its permissions to be executable:

chmod 700 google

The file should contain a list of tasks to run. In my case, for instance:

echo 'Schestowitz' :
perl gsuggest.pl schestow

echo 'Roy Schestowitz' :
perl gsuggest.pl roy sches



echo Schestowitz.com :
perl gpr.pl http://schestowitz.com

echo Schestowitz.com/Weblog :
perl gpr.pl http://schestowitz.com/Weblog

echo Schestowitz.com/Gallery :
perl gpr.pl http://schestowitz.com/Gallery

echo Othellomaster.com :
perl gpr.pl http://othellomaster.com

echo Harvey Tobkes :
perl gpr.pl http://tobkes.othellomaster.com

echo Daniel Sorogon :
perl gpr.pl http://www.danielsorogon.com

Needless to mention, all files must be available by forming a part of the PATH or by being put in the same directory.

Usage

Then, by running:

google >output.txt

all results should be put in a single text file.

Automation

A cron job such as:

50 20 * * * [YOUR_PATH]/google >[YOUR_PATH]/output.txt

will update the file every night. If you are not familiar with cron jobs, now is a good time to find out. To find out only about differences, i.e. to see changes in terms of numbers, one can do the following:

30 22 * * * [YOUR_PATH]/google >[YOUR_PATH]/GoogleCron/new
31 22 * * * diff [YOUR_PATH]/GoogleCron/new [YOUR_PATH]/GoogleCron/old >[YOUR_PATH]/output.txt
32 22 * * * mv [YOUR_PATH]/GoogleCron/new [YOUR_PATH]/GoogleCron/old


[YOUR_PATH]/GoogleCron/old needs to be set up as the 'template' (base) file, but there is plenty of room for extension of this idea.

Greedy Queries

If the above does not provide sufficient infomation or is too static, follow the steps below. There is a way of keeping track of SERP's, links, and corresponding PageRank. The Perl code is very bandwidth-greedy so it should be use with great restraint.

  • Get prog.pl from John Bokma
  • Tailor the cron job script along the lines that suit you (see example file), but please do not abuse the network capacity

Detecting Changes in Regular Pages

You can in principle fetch a Web page every night and have changes flagged to you. The approach is similar to the one above, but it relies on wget for obtaining content.

35 22 * * * cd [YOUR_PATH]/Syndication/
36 22 * * * wget -l0 -H -t0 -nd -N -np -erobots=off http://[THE PAGE TO SYNDICATE]
37 22 * * * diff old index.html >~/Desktop/difference
38 22 * * * cp index.html old

The file named difference in Desktop will now contain all changes in your page/s of choice. Simple, yet very powerful.

Links

See the HTML syndication page which provides a more systematic way of syndicating standard, static Web pages.


This page was last modified on April 23rd, 2005 Maintained by Roy Schestowitz