Received: with ECARTIS (v1.0.0; list gopher); Thu, 13 Oct 2005 00:23:56 -0500 (CDT) Received: from netblock-66-159-214-137.dslextreme.com ([66.159.214.137] helo=floodgap.com ident=nobody) by glockenspiel.complete.org with esmtp (Exim 4.50) id 1EPvZJ-0001N6-RN for gopher@complete.org; Thu, 13 Oct 2005 00:23:56 -0500 Received: (from spectre@localhost) by floodgap.com (6.6.6.666/2005.03.01) id WAA09612 for gopher@complete.org; Wed, 12 Oct 2005 22:23:09 -0700 From: Cameron Kaiser Message-Id: <200510130523.WAA09612@floodgap.com> Subject: [gopher] Re: New Gopher Wayback Machine Bot In-Reply-To: <20051013025233.GA26984@katherina.lan.complete.org> from John Goerzen at "Oct 12, 5 09:52:33 pm" To: gopher@complete.org Date: Wed, 12 Oct 2005 22:23:09 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL39 (25)] MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-Spam-Status: No (score 0.4): AWL=0.345, FORGED_RCVD_HELO=0.05 X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Thu, 13 Oct 2005 00:23:56 -0500 X-archive-position: 1117 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: spectre@floodgap.com Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher > > > Cameron, floodgap.com seems to have some sort of rate limiting and keeps > > > giving me a Connection refused error after a certain number of documents > > > have been spidered. > > > > I'm a little concerned about your project since I do host a number of large > > subparts which are actually proxied services, and I think even a gentle bot > > going methodically through them would not be pleasant for the other side > > (especially if you mean to regularly update your snapshot). > > Valid concern. I had actually already marked your site off-limits > because I noticed that. Incidentally, your robots.txt doesn't seem to > disallow anything -- might want to take a look at that ;-) I know ;) it's because Veronica-2 won't harm the proxied services due to the way it operates. However, I should be able to accomodate other bots that may be around or come on board, so I'll rectify this. > > I do support robots.txt, see > > > > gopher.floodgap.com/0/v2/help/indexer > > Do you happen to have the source code for that available? I've got > some questions for you that it could explain (or you could), such as: > > 1. Which would you use? (Do you expect URLs to be HTTP-escaped?) > > Disallow: /Applications and Games > Disallow: /Applications%20and%20Games > > 2. Do you assume that all Disallow patterns begin with a slash as they > do in HTML, even if the Gopher selector doesn't? > > 3. Do you have any special code to handle the UMN case where > 1/foo, /foo, and foo all refer to the same document? > > I will be adding robots.txt support to my bot and restarting it shortly. It does not understand URL escaping, but literal selectors only. In the case of #2/#3, well, maybe it would be better just to post the relevant code. It should be relatively easy to understand (in Perl, from the V-2 iteration library). $psr is the persistent state hash reference, and key "xcnd" contains a list of selectors generated from Disallow: lines with User-agent: veronica or *. # filter on exclusions my %excludes = %{ $psr->{"$host:$port"}->{"xcnd"} }; my $key; foreach $key (sort { length($a) <=> length($b) } keys %excludes) { return (undef, undef, undef, undef, undef, 'excluded by robots.txt', 1) if ($key eq $sel || $key eq "$sel/" || ($key =~ m#/$# && substr($sel, 0, length($key)) eq $key)); } As you can see from here, they would need to be specified separately, since other servers might not treat them the same. -- ---------------------------------- personal: http://www.armory.com/~spectre/ -- Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@floodgap.com -- An apple every eight hours will keep three doctors away. -------------------