guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What's next with culturia search engine? (and guile-wiredtiger)


From: Amirouche Boubekki
Subject: Re: What's next with culturia search engine? (and guile-wiredtiger)
Date: Sun, 14 Jan 2018 11:05:29 +0100
User-agent: Roundcube Webmail/1.1.2

On 2018-01-14 09:12, Catonano wrote:
2017-11-26 23:33 GMT+01:00 Amirouche Boubekki
<address@hidden>:

The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword.

The least frequent keyword ?

Not the most frequent keyword ?

Yes, imagine you search for serif+font, most common
word and the least discriminant is "font" because there
is (I think) more page containing "font".

The result of the inverted lookup above is used as seed
of the rest of the algorithm that is O(n) so I need to
minimize 'n' ie. the count of initial documents.


That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
rest of the query for instance. Something like:

(filter (hit? (cdr query)) seed)

Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.

This is badly explained.  At this point SEED contains the unique
identifier of document that contains the least frequent word.
We remove it from the query hence the (cdr query) and filter
the SEED with the rest of the query. This is small optimization,
because we know that the least frequent word is already in the
documents found in the SEED, so we do not need to check its
presence in the SEED documents. 'hits?' will return somekind
of state-machine that will check that a given document match
the QUERY passed as argument.

That what I mean to do, the (cdr query) to remove the most
discriminant query term is not implemented, yet.


[4]

https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
[8]

file not fond


It's here: https://github.com/a-guile-mind/culturia.one/blob/master/src/ix.scm#L439

I reworked the thing to use grf3 graph abstraction to store
the documents.

Also guile-wiredtiger 0.6.4 is in guix.



All this looks pretty interesting but I have to say that I prefer the
work you're doing on GNUNet ;-)

Tx for you interest!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]