guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HELP] a search engine in GNU Guile


From: Amirouche Boubekki
Subject: Re: [HELP] a search engine in GNU Guile
Date: Sun, 04 Sep 2016 15:35:39 +0200
User-agent: Roundcube Webmail/1.1.2

Héllo,


I'd like to share with you a mini-project on the road of Culturia 0.1
[0] which is a boolean keyword search engine (similar in principle to
xapian, lucene and whoosh (with less polish and features)).

[0] https://github.com/amirouche/Culturia

The dependencies are wiredtiger develop branch and html2text cli utility.

The main design choice in this search engine is the use of an *inverted
index* to store documents. It only works with languages that have space
between words like english and french.

The procedure used to index documents is `index`, It takes an `url` as
first argument and `html` as second argument. Actually `url` can be
any string reference and `html` can be plain text.

(I am not happy with the naming in this `wsh.scm` in general)

Here is the wiredtiger table definition:

```scheme
(define-public *wsh* '((urls
                        ((uid . record))
                        ((url . string))
                        ())
                       (terms
                        ((uid . record))
                        ((string . string))
                        ((inverse (string) (uid))))
                       (inverted-index
                        ((term . unsigned-integer)
                         (url-uid . unsigned-integer)
                         (position . unsigned-integer))
                        ((nothing . bytes))
                        ((positions (url-uid position) (term))))))
```

This describes the tables used to store the informations needed by the
search to execute (somewhat) efficently its algorithm.

The first table `urls` associates an unique identifier to urls.

`terms` associate a unique identifier with each term found in the
whole database.

When a document is indexed via `(index url html)`, `html` is
preprocessed through `html2text` and `string->tokens`. In particular,
`string->tokens` will:

- downcase everything
- remove single char words
- remove words found in the stopwords.txt file

`inverted-index` is the gist of this way of building a search engine
will associate each term, with the documents where it's found with its
position.

Here is an example run:

```scheme
(let ((query (search/and (search/term "database")
                         (search/or (search/term "postgresql")
                                    (search/term "pgsql")))))
   (search* query))
```

Which is the scheme expression of `database AND (pogresql OR pgsql)`.
The search vm also support negation via `search/not`.

Scoring is done only using term frequency in the document and not using
tf-idf because it makes the implementation simpler.

If you are interested in how search queries are translated to database
procedure calls the entry point is `search*` obviously but the gist of
the algorithm happens in `search/vm` [1]. Basically it boilds down to
do single term search via `search` procedure [2] and do list set
operations like intersection and difference. I think there is an
optimisation to do here but I did not bother.

[1] https://github.com/amirouche/Culturia/blob/wsh-inverted-index/src/wsh.scm#L106 [2] https://github.com/amirouche/Culturia/blob/wsh-inverted-index/src/wsh.scm#L66

Have a look at `wsh.scm` at the bottom they are few tests.

Also `inverted-index` table has a `positions` index on the `url-uid`
and `position` to make it possible to do "phrase search", even if it's
not part of the current implementation, it is possible to implement
phrase search thanks to this. Phrase search are queries surrounded by
double quotes like "what is gnu". One can implement phrase search as
follow:

- split phrase into words

- search first word in the `inverted-index` table using `search`, this
  will give the `uid` of documents with the position `index` in that
  document of the first word of the phrase.

- lookup `positions` index with `uid`, and `position + 1` and keep the
  uids for which `position + 1` is the next word in the phrase. Repeat
  that step until all words from the phrase are consumed.

And you have phrase search.

There is another thing that doesn't work right now. If you look up a
term that is not in the databae it will fail. There is a test case for
it in `wsh.scm`.

I pushed this to `wsh-inverted-index` branch:

git clone https://github.com/amirouche/Culturia --branch wsh-inverted-index

And attached the interesting files to this mail.

Regarding **Culturia**, I wish to re-code this on top of a graph
datastructure to have more flexibility regarding how documents are
related with terms and how terms are related together. Right now it's
not possible without using another table to relate terms
together. Culturia will rely on graphdb mainly to make the
implementation simpler. The graph datastructure `grf3` [3] is
implemented on top of ukv tuple space [4] which is a new
implementation of uav database which make use of wiredtigerz new `env`
multithread support.

I also updated the documentation regarding `wiredtiger` and `wiredtigerz`.
Have a look at it online github [5][6].

[3] https://github.com/amirouche/Culturia/blob/474b5910a6f1a8541280f53bc6ae6b8458929e7f/src/grf3.scm [4] https://github.com/amirouche/Culturia/blob/474b5910a6f1a8541280f53bc6ae6b8458929e7f/src/ukv.scm
[5] https://amirouche.github.io/Culturia/doc/wiredtiger.html
[6] https://amirouche.github.io/Culturia/doc/wiredtigerz.html

Happy hacking!

On 2016-08-13 17:25, Amirouche Boubekki wrote:
Héllo,


The goal of Culturia is to create a framework that makes it easy
to tape into Natural Language Understanding algorithms (and NLP)
and provide an interface for common tasks.

Culturia is an intelligence augmentation software.

It's primary interface is a search engine. Another important aspect
of the project is that it wants to be useable offline as such it will
come with infrastructure to dump, load and store dataset for offline use.

The current state of the project can be described as a big ball of mud.
There is a tiny search engine with crawling skills and that's basically
all of it.

The immediate changes that should happen are in order of preference:

- offline stackoverflow (cf. sotoki.scm) and use the generated
  website to create a zim for kiwix [0]. This is great occasion to
  show how great GNU Guile is!
- port whoosh/lucene to guile to improve text search
- offline hackernews, wikidata, wikipedia, wiktionary
- implement BM25f

Culturia is a reference to _Culture and Empire_ by Pieter Hintjens.

It has a sparse documentation is available online [1].
It's hosted on github [2] (This can change, if contributors
don't want to use github).

The TODO list is big, here is some stuff that needs to be done:

- finish GrammarLink bindings
- create sophia [3] bindings
- implement TextRank
- implement PageRank
- create a GUI using sly or html
- explore ways to easily share database among several processus

And many other things! Newbies are accepted obviously!

Send me a mail or use #guile @ irc.freenode.net, I am amz3.


Happy hacking!


[0] http://www.kiwix.org/wiki/Main_Page
[1] https://amirouche.github.io/Culturia/doc/
[2] https://github.com/amirouche/Culturia
[3] http://sophia.systems/

--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

Attachment: stopwords.en.txt
Description: Text document

Attachment: text.scm
Description: Text document

Attachment: wiredtiger.scm
Description: Text document

Attachment: wiredtigerz.scm
Description: Text document

Attachment: wsh.scm
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]