Received: with ECARTIS (v1.0.0; list gopher); Wed, 21 Dec 2005 12:44:13 -0600 (CST) Received: from [69.217.43.23] (helo=hal3000.cx ident=root) by glockenspiel.complete.org with esmtp (Exim 4.50) id 1Ep8wZ-0000zP-IV for gopher@complete.org; Wed, 21 Dec 2005 12:44:12 -0600 Received: from work1.hal3000.cx (work1.hal3000.cx [10.0.0.2]) by hal3000.cx (8.9.3/8.9.3) with SMTP id MAA03507 for ; Wed, 21 Dec 2005 12:43:40 -0600 (CST) (envelope-from chris@hal3000.cx) Date: Wed, 21 Dec 2005 12:40:58 -0600 From: Chris To: gopher@complete.org Subject: [gopher] Re: Whats all this talk about? Message-Id: <20051221124058.14f9f557@work1.hal3000.cx> In-Reply-To: <200512210030.QAA18218@floodgap.com> References: <20051220180726.6afcb532@work1.hal3000.cx> <200512210030.QAA18218@floodgap.com> X-Mailer: Sylpheed version 0.9.10claws (GTK+ 1.2.10; i386-portbld-freebsd4.9) Mime-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-Spam-Status: No (score 0.1): AWL=-0.000, FORGED_RCVD_HELO=0.05 X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Wed, 21 Dec 2005 12:44:12 -0600 X-archive-position: 1184 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: chris@hal3000.cx Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher Ok my Data is set up differently than yours... 128.112.128.152 .. .. .. .. ..text/plain 39 Kb Dec 16 20:13 128.118.88.200 . .. .. .. .. ..text/plain 154 Kb Dec 16 20:13 128.138.77.16 .. .. .. .. .. ..text/plain 15 Kb Dec 16 20:13 128.143.22.55 .. .. .. .. .. ..text/plain 92 Kb Dec 16 20:13 128.228.1.2 . .. .. .. .. .. ..text/plain 74 Kb Dec 16 20:13 128.32.112.200 . .. .. .. .. ..text/plain 641 Kb Dec 16 20:13 129.79.225.200 . .. .. .. .. ..text/plain 4165 Kb Dec 16 20:13 130.149.17.12 .. .. .. .. .. ..text/plain 8412 Kb Dec 16 20:13 132.239.50.108 . .. .. .. .. ..text/plain 4841 bytes Dec 16 20:13 132.248.10.7 .. .. .. .. .. ..text/plain 3024 Kb Dec 16 20:13 134.124.15.133 . .. .. .. .. ..text/plain 818 Kb Dec 16 20:13 138.247.32.12 .. .. .. .. .. ..text/plain 4603 bytes Dec 16 20:13 140.113.209.66 . .. .. .. .. .. unknown 1375 Kb Dec 16 20:13 140.198.26.4 .. .. .. .. .. ..text/plain 478 bytes Dec 16 20:13 142.32.161.61 .. .. .. .. .. ..text/plain 442 bytes Dec 16 20:13 150.201.32.1 .. .. .. .. .. ..text/plain 1504 bytes Dec 16 20:13 158.182.4.1 . .. .. .. .. .. ..text/plain 53 Kb Dec 16 20:13 160.94.23.22 .. .. .. .. .. ..text/plain 546 bytes Dec 16 20:13 169.226.140.28 . .. .. .. .. ..text/plain 3202 bytes Dec 16 20:13 192.58.246.4 .. .. .. .. .. ..text/plain 1085 Kb Dec 15 03:28 192.98.80.1 . .. .. .. .. .. ..text/plain 4643 bytes Dec 16 20:13 193.225.12.73 .. .. .. .. .. ..text/plain 1148 Kb Dec 16 20:13 193.225.12.74 .. .. .. .. .. ..text/plain 209 bytes Dec 16 20:13 198.108.1.48 .. .. .. .. .. ..text/plain 78 Kb Dec 16 20:13 198.135.224.143 .. .. .. .. ..text/plain 14 Kb Dec 16 20:13 198.151.172.33 . .. .. .. .. ..text/plain 1413 bytes Dec 16 20:13 198.161.91.194 . .. .. .. .. ..text/plain 2074 bytes Dec 15 03:28 198.30.120.11 .. .. .. .. .. ..text/plain 9656 bytes Dec 16 20:13 199.125.85.11 .. .. .. .. .. ..text/plain 7418 bytes Dec 16 20:13 206.80.4.10 . .. .. .. .. .. ..text/plain 200 Kb Dec 16 20:13 209.113.213.86 . .. .. .. .. ..text/plain 92 Kb Dec 16 20:13 209.216.94.5 .. .. .. .. .. ..text/plain 107 Kb Dec 15 03:28 212.68.221.103 . .. .. .. .. ..text/plain 0 bytes Dec 15 03:28 213.237.16.246 . .. .. .. .. ..text/plain 84 Kb Dec 15 03:28 216.138.233.67 . .. .. .. .. ..text/plain 8158 bytes Dec 15 03:28 216.143.130.27 . .. .. .. .. ..text/plain 404 bytes Dec 16 20:13 216.99.211.113 . .. .. .. .. ..text/plain 38 Kb Dec 15 03:28 217.215.6.225 .. .. .. .. .. ..text/plain 1250 Kb Dec 15 03:28 24.185.18.37 .. .. .. .. .. ..text/plain 0 bytes Dec 15 03:28 66.159.214.138 . .. .. .. .. ..text/plain 247 Kb Dec 15 03:28 66.18.231.71 .. .. .. .. .. ..text/plain 330 Kb Dec 15 03:28 67.18.92.178 .. .. .. .. .. ..text/plain 3319 bytes Dec 15 03:28 69.21.205.10 .. .. .. .. .. ..text/plain 1967 Kb Dec 15 03:29 69.217.43.23 .. .. .. .. .. ..text/plain 50 Kb Dec 15 03:29 72.192.21.54 .. .. .. .. .. ..text/plain 6045 bytes Dec 15 03:29 80.68.194.26 .. .. .. .. .. ..text/plain 64 Kb Dec 15 03:29 80.89.239.61 .. .. .. .. .. ..text/plain 8569 bytes Dec 15 03:29 84.139.112.5 .. .. .. .. .. ..text/plain 4064 bytes Dec 15 03:29 Data .. .. .. .. .. .. .. .. ..text/plain 11 Mb Dec 16 20:13 Data.offset.0.5 .. .. .. .. ..text/plain 1225 Kb Dec 16 20:13 Other . .. .. .. .. .. .. .. ..text/plain 5269 Kb Dec 16 20:13 Or ll -h : total 43486 38K Dec 16 20:13 128.112.128.152 153K Dec 16 20:13 128.118.88.200 14K Dec 16 20:13 128.138.77.16 92K Dec 16 20:13 128.143.22.55 74K Dec 16 20:13 128.228.1.2 640K Dec 16 20:13 128.32.112.200 4M Dec 16 20:13 129.79.225.200 8M Dec 16 20:13 130.149.17.12 4K Dec 16 20:13 132.239.50.108 2M Dec 16 20:13 132.248.10.7 817K Dec 16 20:13 134.124.15.133 4K Dec 16 20:13 138.247.32.12 1M Dec 16 20:13 140.113.209.66 478B Dec 16 20:13 140.198.26.4 442B Dec 16 20:13 142.32.161.61 1K Dec 16 20:13 150.201.32.1 53K Dec 16 20:13 158.182.4.1 546B Dec 16 20:13 160.94.23.22 3K Dec 16 20:13 169.226.140.28 1M Dec 15 03:28 192.58.246.4 4K Dec 16 20:13 192.98.80.1 1M Dec 16 20:13 193.225.12.73 209B Dec 16 20:13 193.225.12.74 77K Dec 16 20:13 198.108.1.48 14K Dec 16 20:13 198.135.224.143 1K Dec 16 20:13 198.151.172.33 2K Dec 15 03:28 198.161.91.194 9K Dec 16 20:13 198.30.120.11 7K Dec 16 20:13 199.125.85.11 200K Dec 16 20:13 206.80.4.10 91K Dec 16 20:13 209.113.213.86 106K Dec 15 03:28 209.216.94.5 0B Dec 15 03:28 212.68.221.103 83K Dec 15 03:28 213.237.16.246 7K Dec 15 03:28 216.138.233.67 There are: 92219 lines of individual words attributed to which gopher they came from. These words are derived from file names as well _not_ full text. This is in the Data file. 62434 lines of text attributed to which gopher they came from are in "Other" file 95171 words are in the offset file and only given a number, I believe this is how the partial words are assigned numeric values . I manually, the first few times, have indexed each gopher one by one, based on the spiderings of my gspider (author Tim Fraser from the list). I will be making a script to index as I do on Jughead. Each site is crawled to a set depth of directories, but not off the site (yet). So the site is crawled but the web is not by Veronica-local, Gspider crawls link to link finding sites, I script a simple script to hit sites and actually darned near use the lists in the About files as they are. I am using Jughead and Veronica in the same way as far as crawling , of course this is for now, eventually I want Veronica to crawl the web on her own and Jughead will be pointed by a script, derived from Gspider or another spider or my whim I guess ;). By the way my last Jughead data file ended up just a tiny bit smaller than Veronica , this is because I have pulled some sites from Jughead and put them on Veronica and balanced them, so that together they should eventually cover a very large portion of gopherspace. I am right around 200,000 selectors (selectors being words attributed to a gopher or gophers that are shown as results when searched ) with both. Jughead on my system can handle that alone but since it builds the database in ram I cant go tooo much farther...maybe 425-440,000 selectors, Veronica doesnt have that issue so it looks very promising. I am unsure if my terminolgy and yours is exactly the same re: selectors In my mind a word indexed on a gopher then becomes a selector just to clarify. In Veronica there is a word list nearly 100,000 lines long each line contains from whence it came, in Jughead there are about the same amount of lines but the word is from just that site Jughead: ISuburban4 /Audio_Visual/Images/Suburban/Suburban4.jpg hal3000.cx Veronica: suburban:4 207 68200|11 125 694148|31 328 124692| (see one word at more than one gopher vs Jughead which is word and gopher site) On Tue, 20 Dec 2005 16:30:50 -0800 (PST) Cameron Kaiser wrote: > > The box Veronica is on is a p200mmx > > Jughead is on another p200mmx > > Freebsd for both. > > The list of sites is included with the > > About_Veronica_Search text and > > About_Multi_Search talks of Jughead > > Ah, thanks (*reads it*). > > > I am having problems with the .tree script in > > that there is not any decent fall backs for things > > like high latency or lost connection, there is > > an "Alarm" sent in text and that ends the "tree-ing" > > for that site. This may be why the results are so far > > differing at times with yours Cameron. > > Is the set up actually a crawler? It's not clear to me if you're using a > predigested index the outside sites provide, or if you're crawling it > yourself. I'm assuming based on > > > I have shown which sites "Alarmed" and therefore > > are incomplete. > > For instance: > > gopher.semo.edu #alarm long way in > > that is to say after a long time and quite far > > in the tree I recieved an alarm which indcates > > one of several things, timeout, loss of connection, > > exceded "depth" etc. > > that you are crawling it yourself. > > > Cameron I think you are indexing more than I atm as > > well, with my raw data being about 20M and the data > > file being 10M with a 1M offset file and a 5M "other" > > file . > > How many selectors does that translate to? For the record, > > gopher% ls -sk # in kilobytes > total 956399 > 146408 history.MYD 3664 prospects.MYI 12 stats.frm > 105496 history.MYI 12 prospects.frm 304968 textil.MYD > 12 history.frm 6 stats.MYD 391104 textil.MYI > 4696 prospects.MYD 9 stats.MYI 12 textil.frm > > so not quite a gig so far. Note it is not full-text. > > textil is the keyword and relevancy table, history is the selector/display > string database, prospects is the workspace table and stats is cached > precomputed statistics used for /world. This is with 1.1 million selectors, > give or take a couple thousand, using my regular "stupid" crawler library. > > Mind you, this is not a competition :) I'm just curious about how you're > getting things up and running. So far you seem to be getting pretty good > results for an early effort, so you are to be congratulated. > > -- > --------------------------------- personal: http://www.armory.com/~spectre/ --- > Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@floodgap.com > -- Hi! I am a .signature virus. Copy me into your .signature to join in! ----- > > > -- Join FSF as an Associate Member at: