Received: with ECARTIS (v1.0.0; list gopher); Wed, 18 Jun 2003 08:14:34 -0500 (CDT) Return-Path: X-Original-To: gopher@complete.org Delivered-To: gopher@complete.org Received: by gesundheit.complete.org (Postfix, from userid 108) id 9429C1832068; Wed, 18 Jun 2003 08:14:32 -0500 (CDT) X-Scanned-By: clamscan at complete.org Received: from floodgap.com (netblock-66-159-214-137.dslextreme.com [66.159.214.137]) by gesundheit.complete.org (Postfix) with ESMTP id 5C81A1832049 for ; Wed, 18 Jun 2003 08:14:29 -0500 (CDT) Received: (from spectre@localhost) by floodgap.com (8.9.1/2003.05.26) id GAA12136 for gopher@complete.org; Wed, 18 Jun 2003 06:24:18 -0700 From: Cameron Kaiser Message-Id: <200306181324.GAA12136@floodgap.com> Subject: [gopher] Veronica-2 and robot exclusion To: gopher@complete.org Date: Wed, 18 Jun 2003 06:24:18 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL39 (25)] MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 756 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: spectre@floodgap.com Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher Let's revisit an old issue since we're not talking about anything right now :-) Veronica-2 is moving right along. The client libraries are now on the new server, and I'm setting up a test network of three simulated gopher servers to bounce requests off of and study autoprune and performance before unleashing it on an unsuspecting populace. I'm rewriting the crawler from scratch to get rid of some old inefficiencies that were in the previous version (and also to leverage better database support, which will further improve its indexing capabilities). Now's the time to get your requests in. I re-read the long thread back in January '01 where we went over how to implement a robot exclusion standard for Gopherspace, and the solution that had the widest support (except from yours truly at that time) was to re-use the existing robots.txt (for your convenience, the current 1994 standard is available at gopher://gopher.floodgap.com/0/v2/robotstxt.txt ). Now that the crawler is off the University network and on my dime, I see no reason not to let the crawler shuffle around 24/7/365, so this limits the re-fetch penalty of using a global robots.txt file for each gopher, and pretty much limits my only major complaint. Reviewing the robots.txt spec, it seems that this can pretty much be adapted as is for gopherspace. The Veronica-2 robot's User-Agent (like it really matters, since it's the only crawler that I know of) will be "veronica-2". A few things will be grokked differently by this robot which are worth talking about. If you specify a Disallow of "/", not only will V-2 not index your gopher, but it will not even mark you as an active server, and you will not appear in its server statistics list (as a consequence of where robot exclusion filtering is done when selecting new prospects and updating the global statistics table). Is this desirable/correct behaviour? The current timeout on such files is twenty-four hours from first server access (it may be fetched slightly more frequently under extraordinary circumstances). Also, I was thinking of making the file ".robots.txt" since many Unix gophers don't serve dot-files, although there are a growing number of Windows-hosted gophers and I don't know if it will break these (I don't do x86 myself). Comments requested, since I'm writing this Right This Minute (tm). -- ---------------------------------- personal: http://www.armory.com/~spectre/ -- Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@floodgap.com -- The early bird may get the worm, but the second mouse gets the cheese. -----