Received: with LISTAR (v1.0.0; list gopher); Mon, 15 Jan 2001 11:15:29 -0600 (CST) Return-Path: Delivered-To: gopher@complete.org Received: from stockholm.ptloma.edu (stockholm.ptloma.edu [199.106.86.50]) by pi.glockenspiel.complete.org (Postfix) with ESMTP id 332173B949 for ; Mon, 15 Jan 2001 11:15:28 -0600 (CST) Received: (from spectre@localhost) by stockholm.ptloma.edu (8.9.1/8.9.1) id JAA08378 for gopher@complete.org; Mon, 15 Jan 2001 09:13:34 -0800 From: Cameron Kaiser Message-Id: <200101151713.JAA08378@stockholm.ptloma.edu> Subject: [gopher] Re: Gopher "robots.txt" (was Re: New V-2 WAIS database) In-Reply-To: <87lmscq5op.fsf@complete.org> from John Goerzen at "Jan 15, 1 08:57:10 am" To: gopher@complete.org Date: Mon, 15 Jan 2001 09:13:34 -0800 (PST) X-Mailer: ELM [version 2.4ME+ PL39 (25)] MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 105 X-listar-version: Listar v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: spectre@stockholm.ptloma.edu Precedence: bulk Reply-to: gopher@complete.org X-list: gopher > > Mind you, I'd be happy with any approach that works on a per-menu level, > > just as long as the bot doesn't have to cache every server's particular > > robot policy and can determine the policy for a selector from the menu(s) > > that reference that selector. This is just one way I can think of. > > Well, let's get back to this point. I'm not sure that you must cache > it. Assuming that a robot will traverse an entire server at once (am > I wrong with that assumption?), it would involve only one extra > request to ask for robots.txt before traversal. If, OTOH, servers are > hit in a more random pattern, I can see there would be a problem. No, they are hit randomly so that I can nail the largest number of selectors in a short period of time without "courtesy" waits. While a server will get bursts of activity (I'm sure you may have noticed), after a while the prospects list will get reshuffled and it will work on another host for a bit. If it totally perseverated on a server, I think the server administrator would be totally justified in demanding that it pause between bursts, and that would drop efficiency. The other problem is that the robot is not running all the time, but only after hours so its small but constant network usage won't affect other people. This is a courtesy since the University is letting me keep a private system on their tab. So, when the robot starts back up again, I'll either need to refetch or have the process cache to disk when it dies -- again, the two things I'm trying to avoid. > Is it practical to cache this data then? I might be inclined to > suggest that it is. Given that only a few of the gopher servers out > there are both actively maintained and have a situation warranting > this sort of treatment, perhaps it is not so problematic a situation > after all? True, there are very few, but I don't want to see it scale out of control (as the only current bot operator :-). I'm still waiting for Marco to tell me those regexes so I can hard-code them. (Psst!) -- ----------------------------- personal page: http://www.armory.com/~spectre/ -- Cameron Kaiser, Point Loma Nazarene University * ckaiser@stockholm.ptloma.edu -- Generating random numbers is too important to be left to chance. -----------