Roy Schestowitz

Google Cron

What is it for?

Rather than performing numerous Google searches and rather than browsing to determine Pagerank, let Perl summarise large chunks of figures in a single text file. This can be done in a cyclic manner by setting up a cron job.

Installation

Two Perl scripts are necessary (whether you need both depends on what figures you wish to collect from Google):

The Google Suggest script from John Bokma
The Google PageRank script of Igor Chudov from Algebra Homework Help (algebra.com) (disclosed in alt.internet.search-engines)

For the latter, one has to install the WWW::Google::PageRank module. Open a shell as root and type in:

perl -MCPAN -e shell

This essentially configures the CPAN module, which then enables quick installation of new modules. Once CPAN is set up, simply type in:

install WWW::Google::PageRank

Having got the PageRank script (gpr.pl) and the Google Suggest script (gsuggest.pl), set up a file named google, for example, and set its permissions to be executable:

chmod 700 google

The file should contain a list of tasks to run. In my case, for instance:


echo 'Schestowitz' :

perl gsuggest.pl schestow



echo 'Roy Schestowitz' :

perl gsuggest.pl roy sches





echo Schestowitz.com :

perl gpr.pl http://schestowitz.com



echo Schestowitz.com/Weblog :

perl gpr.pl http://schestowitz.com/Weblog



echo Schestowitz.com/Gallery :

perl gpr.pl http://schestowitz.com/Gallery



echo Othellomaster.com :

perl gpr.pl http://othellomaster.com



echo Harvey Tobkes :

perl gpr.pl http://tobkes.othellomaster.com



echo Daniel Sorogon :

perl gpr.pl http://www.danielsorogon.com

Needless to mention, all files must be available by forming a part of the PATH or by being put in the same directory.

Usage

Then, by running:

google >output.txt

all results should be put in a single text file.

Automation

A cron job such as:

50 20 * * * [YOUR_PATH]/google >[YOUR_PATH]/output.txt

will update the file every night. If you are not familiar with cron jobs, now is a good time to find out. To find out only about differences, i.e. to see changes in terms of numbers, one can do the following:


30 22 * * * [YOUR_PATH]/google >[YOUR_PATH]/GoogleCron/new

31 22 * * * diff [YOUR_PATH]/GoogleCron/new [YOUR_PATH]/GoogleCron/old >[YOUR_PATH]/output.txt

32 22 * * * mv [YOUR_PATH]/GoogleCron/new [YOUR_PATH]/GoogleCron/old

[YOUR_PATH]/GoogleCron/old needs to be set up as the 'template' (base) file, but there is plenty of room for extension of this idea.

Greedy Queries

If the above does not provide sufficient infomation or is too static, follow the steps below. There is a way of keeping track of SERP's, links, and corresponding PageRank. The Perl code is very bandwidth-greedy so it should be use with great restraint.

Get prog.pl from John Bokma
Tailor the cron job script along the lines that suit you (see example file), but please do not abuse the network capacity

Detecting Changes in Regular Pages

You can in principle fetch a Web page every night and have changes flagged to you. The approach is similar to the one above, but it relies on wget for obtaining content.


35 22 * * * cd [YOUR_PATH]/Syndication/

36 22 * * * wget -l0 -H -t0 -nd -N -np -erobots=off http://[THE PAGE TO SYNDICATE]

37 22 * * * diff old index.html >~/Desktop/difference

38 22 * * * cp index.html old

The file named difference in Desktop will now contain all changes in your page/s of choice. Simple, yet very powerful.

Links

See the HTML syndication page which provides a more systematic way of syndicating standard, static Web pages.

This page was last modified on April 23rd, 2005

Maintained by Roy Schestowitz