Received: with ECARTIS (v1.0.0; list gopher); Tue, 29 Nov 2005 21:07:07 -0600 (CST) Received: from [69.217.43.23] (helo=hal3000.cx ident=root) by glockenspiel.complete.org with esmtp (Exim 4.50) id 1EhIJD-0004Wx-6X for gopher@complete.org; Tue, 29 Nov 2005 21:07:07 -0600 Received: from work1.hal3000.cx (work1.hal3000.cx [10.0.0.2]) by hal3000.cx (8.9.3/8.9.3) with SMTP id VAA56554 for ; Tue, 29 Nov 2005 21:06:50 -0600 (CST) (envelope-from chris@hal3000.cx) Date: Tue, 29 Nov 2005 21:03:35 -0600 From: Chris To: gopher@complete.org Subject: [gopher] Re: Bot update Message-Id: <20051129210335.18b61281@work1.hal3000.cx> In-Reply-To: <20051129232006.GP19727@complete.org> References: <20051031034851.GA30223@katherina.lan.complete.org> <20051129232006.GP19727@complete.org> X-Mailer: Sylpheed version 0.9.10claws (GTK+ 1.2.10; i386-portbld-freebsd4.9) Mime-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-Spam-Status: No (score 0.1): FORGED_RCVD_HELO=0.05 X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Tue, 29 Nov 2005 21:07:07 -0600 X-archive-position: 1162 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: chris@hal3000.cx Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher Hi John and all, I had thought putting it on cd for starters was a good idea and I would like to reserve one myself if you do. I will host it here on the gopher in its entirety, but, as we spoke briefly before its a big chunk to search through... Perhaps Floodgaps veronica can handle such a large database I don't know. I cannot handle that much information on a single jughead and am looking at a better method to have users search through the entire database. Some other possibilities came to mind, things such as breaking it up into datasets and having various boxen here as well as at other gophers each maintain a dataset or sets. These could be broken up in various ways , locale, alpha/numerical, domain or just from start to finish in chunks of a certain size. I don't know if thats the best method or not, just bouncing some ideas around. Another point from a concern of various spidering methods and data retrievals. If anyone is placing such a large database on a gopher could we perhaps all make sure to put it in the same /bigdatabase dir? This way we can skip it on jughead multi site searches or whatever as well as putting it in robots.txt to try and minimize accidentally grabbing it on a crawl? These were just some thoughts I had. Thanks John for getting it I think it's awesome and am excited to see what we can all do with it. Chris gopher://hal3000.cx On Tue, 29 Nov 2005 17:20:06 -0600 John Goerzen wrote: > On Wed, Nov 16, 2005 at 10:04:17PM -0600, Jeff wrote: > > On Sun, 30 Oct 2005 21:48:51 -0600, John Goerzen > > wrote: > > > > > Here's an update on the gopher bot: > > > > > > There is currently 28G of data archived representing 386,315 > > > documents. 1.3 million documents remain to be visited, from > > > approximately 20 very large Gopher servers. I believe, then, that the > > > majority of gopher servers have been cached by this point. 3,987 > > > different servers are presently represented in the archive. > > > > Any news? > > Not really. The bot hit a point where its algorithm for storing page > information was getting to be too slow, and there was also a problem > with the database layer I'm using segfaulting. When I get some time, I > will write a new layer. > > In the meantime, I'd like to talk about how to get this data to others > that might be willing to host it, as well as how to store it out there > for the public. Any ideas? > > > > -- Join FSF as an Associate Member at: