__/ [David Dyer-Bennet] on Saturday 17 September 2005 19:00 \__
> I don't mean content, I mean what technical things about the way an
> image is presented on a page will prevent them from indexing them?
> And I mean in images.google.com, NOT the basic web search.
You can use metedata to prevent crawling. You must ensure that crawlers have
no path via which they descend to images. Image file themselves have little
or no information associated with them (magic number et cetera?).
> No money rides on this (for me, I mean; I'm sure some people have
> business models that depend on google indexing their images), I'm just
> My online images are mostly presented in thumbnail pages (static html;
> generated from a script just once, not on-demand), with the thumbnails
> linking to a script that produces a page with the full-size image and
> other associated information (caption, tech info, navigation links to
> walk through the gallery without returning to the thumbnail;
> conventional stuff for a photo gallery).
Thumbnails will often get indexed before the full-sized image equivalents. I
think that's rather intuitive. Just as a viewer sees thumbnails first, so
will a crawler (bot).
Wait until more crawling is completed. That would be my advice. Images index
in Google gets refresh 2 or 3 times a year, but crawling remains persistent
> And the full-size images mostly don't get indexed. In a few places
> where I've put up a static page with inline images, they *do* get
> indexed -- suggesting that Google is willing to index images on my
> site. And the logs show that google does spider me pretty regularly.
How big are the full-sized images? I have seen galleries with 4 MB JPEG's
that are barely compressed or lossy.
> I've been poking slowly at my problem here for several years now. I
> suspect that google is unhappy with some of the headers my script for
> the full-size image page produces. I've been playing with those,
> trying to get them as vanilla as possible...
I suggest that you don't intevene until you get definite answers. You might
be throwing away valuable time, damaging your site in the process.
What is the nature of these scripts?
> ...In particular I've made
> sure that I generate a reasonable "last-modified:", and I've put in a
> "cache-control: public" (I have no reason to believe google pays
> attention to that, but a number of cache strategies don't cache
> dynamic pages unless explicitly told to, so it's the right thing for
> me to do on these pages). And that the script responds correctly to
> an if-modified-since header (and in fact I've got in the log a recent
> example of googlebot receiving a 304 response on the URL of one of
> these pages, so googlebot does send if-modified-since).
> And google clearly indexes stuff in dynamic photo albums -- it's easy
> to find examples. I've looked at the headers that Gallery, for
> example, generates, and I don't see anything "wrong" with mine, but
> there are cetainly differences.
I use Gallery too and Google appears to index it properly. The mistake I
once made was that I changed album names (slugs). Even 4 months later, I
still get many 404's as a result. I don't think there is something
inherently bad with Gallery and its interaction with bots.
> Has anybody worked through this issue and has a clear characterization
> of what Google images (images; not the main google web search) will
> and won't index? And is willing to share?
Here are a few observations:
-Google bind image descriptions (keywords), or vice versa, to the page title
and headers, probably using some word density tests and finding captions.
They also appear to be useing the name (filename) of the image. I have no
evidence to suggest that the alt attribute gets used much, if at all, in
the assigment to keywords.
-Google Images prefers to return large and clear images to Google Images
users. It does not return thumbnails too often. Moreover, it might be able
to detect presence of small/larger version and make a senseible choice,
removing duplicates in the process.
> There's some slight indications that my latest round of script changes
> is, maybe, now something they're willing to index, but it's only been
> up a few days so no results are in the actual index yet.
Experimentation like that needs to account for many more factor. You cannot
just isolate one. It's like a high-dimensional problem with so many
parameters (PageRank, algorithm changes and so forth). You can fall under a
self-imposed illusion at best.
As I said before, it may take quite a few months for the index to get
modified. The past Google Images update was around 1-2 months ago. I can
clearly remember announcing it in alt.internet.search-engines.
Hope it helps,
Roy S. Schestowitz | Proprietary cripples communication
http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E
1:10pm up 24 days 1:24, 3 users, load average: 0.31, 0.38, 0.72