From: Cameron Kaiser <spectre@stockholm.ptloma.edu>
Message-Id: <200101151713.JAA08378@stockholm.ptloma.edu>
Subject: [gopher] Re: Gopher "robots.txt" (was Re: New V-2 WAIS database)
In-Reply-To: <87lmscq5op.fsf@complete.org> from John Goerzen at "Jan 15,
 1 08:57:10 am"
To: gopher@complete.org
Date: Mon, 15 Jan 2001 09:13:34 -0800 (PST)
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
Sender: gopher-bounce@complete.org
Errors-to: gopher-bounce@complete.org
Precedence: bulk
Reply-to: gopher@complete.org


> > Mind you, I'd be happy with any approach that works on a per-menu level,
> > just as long as the bot doesn't have to cache every server's particular
> > robot policy and can determine the policy for a selector from the menu(s)
> > that reference that selector. This is just one way I can think of.
> 
> Well, let's get back to this point.  I'm not sure that you must cache
> it.  Assuming that a robot will traverse an entire server at once (am
> I wrong with that assumption?), it would involve only one extra
> request to ask for robots.txt before traversal.  If, OTOH, servers are
> hit in a more random pattern, I can see there would be a problem.

No, they are hit randomly so that I can nail the largest number of selectors
in a short period of time without "courtesy" waits. While a server will get
bursts of activity (I'm sure you may have noticed), after a while the
prospects list will get reshuffled and it will work on another host for a bit.
If it totally perseverated on a server, I think the server administrator
would be totally justified in demanding that it pause between bursts, and
that would drop efficiency.

The other problem is that the robot is not running all the time, but only
after hours so its small but constant network usage won't affect other people.
This is a courtesy since the University is letting me keep a private system
on their tab. So, when the robot starts back up again, I'll either need to
refetch or have the process cache to disk when it dies -- again, the two
things I'm trying to avoid.

> Is it practical to cache this data then?  I might be inclined to
> suggest that it is.  Given that only a few of the gopher servers out
> there are both actively maintained and have a situation warranting
> this sort of treatment, perhaps it is not so problematic a situation
> after all?

True, there are very few, but I don't want to see it scale out of control (as
the only current bot operator :-).

I'm still waiting for Marco to tell me those regexes so I can hard-code them.
(Psst!)

-- 
----------------------------- personal page: http://www.armory.com/~spectre/ --
 Cameron Kaiser, Point Loma Nazarene University * ckaiser@stockholm.ptloma.edu
-- Generating random numbers is too important to be left to chance. -----------