|
From: | Amirouche Boubekki |
Subject: | Re: [HELP] a search engine in GNU Guile |
Date: | Sun, 04 Sep 2016 15:35:39 +0200 |
User-agent: | Roundcube Webmail/1.1.2 |
Héllo, I'd like to share with you a mini-project on the road of Culturia 0.1 [0] which is a boolean keyword search engine (similar in principle to xapian, lucene and whoosh (with less polish and features)). [0] https://github.com/amirouche/CulturiaThe dependencies are wiredtiger develop branch and html2text cli utility.
The main design choice in this search engine is the use of an *inverted index* to store documents. It only works with languages that have space between words like english and french. The procedure used to index documents is `index`, It takes an `url` as first argument and `html` as second argument. Actually `url` can be any string reference and `html` can be plain text. (I am not happy with the naming in this `wsh.scm` in general) Here is the wiredtiger table definition: ```scheme (define-public *wsh* '((urls ((uid . record)) ((url . string)) ()) (terms ((uid . record)) ((string . string)) ((inverse (string) (uid)))) (inverted-index ((term . unsigned-integer) (url-uid . unsigned-integer) (position . unsigned-integer)) ((nothing . bytes)) ((positions (url-uid position) (term)))))) ``` This describes the tables used to store the informations needed by the search to execute (somewhat) efficently its algorithm. The first table `urls` associates an unique identifier to urls. `terms` associate a unique identifier with each term found in the whole database. When a document is indexed via `(index url html)`, `html` is preprocessed through `html2text` and `string->tokens`. In particular, `string->tokens` will: - downcase everything - remove single char words - remove words found in the stopwords.txt file `inverted-index` is the gist of this way of building a search engine will associate each term, with the documents where it's found with its position. Here is an example run: ```scheme (let ((query (search/and (search/term "database") (search/or (search/term "postgresql") (search/term "pgsql"))))) (search* query)) ``` Which is the scheme expression of `database AND (pogresql OR pgsql)`. The search vm also support negation via `search/not`. Scoring is done only using term frequency in the document and not using tf-idf because it makes the implementation simpler. If you are interested in how search queries are translated to database procedure calls the entry point is `search*` obviously but the gist of the algorithm happens in `search/vm` [1]. Basically it boilds down to do single term search via `search` procedure [2] and do list set operations like intersection and difference. I think there is an optimisation to do here but I did not bother.[1] https://github.com/amirouche/Culturia/blob/wsh-inverted-index/src/wsh.scm#L106 [2] https://github.com/amirouche/Culturia/blob/wsh-inverted-index/src/wsh.scm#L66
Have a look at `wsh.scm` at the bottom they are few tests. Also `inverted-index` table has a `positions` index on the `url-uid` and `position` to make it possible to do "phrase search", even if it's not part of the current implementation, it is possible to implement phrase search thanks to this. Phrase search are queries surrounded by double quotes like "what is gnu". One can implement phrase search as follow: - split phrase into words - search first word in the `inverted-index` table using `search`, this will give the `uid` of documents with the position `index` in that document of the first word of the phrase. - lookup `positions` index with `uid`, and `position + 1` and keep the uids for which `position + 1` is the next word in the phrase. Repeat that step until all words from the phrase are consumed. And you have phrase search. There is another thing that doesn't work right now. If you look up a term that is not in the databae it will fail. There is a test case for it in `wsh.scm`. I pushed this to `wsh-inverted-index` branch:git clone https://github.com/amirouche/Culturia --branch wsh-inverted-index
And attached the interesting files to this mail. Regarding **Culturia**, I wish to re-code this on top of a graph datastructure to have more flexibility regarding how documents are related with terms and how terms are related together. Right now it's not possible without using another table to relate terms together. Culturia will rely on graphdb mainly to make the implementation simpler. The graph datastructure `grf3` [3] is implemented on top of ukv tuple space [4] which is a new implementation of uav database which make use of wiredtigerz new `env` multithread support.I also updated the documentation regarding `wiredtiger` and `wiredtigerz`.
Have a look at it online github [5][6].[3] https://github.com/amirouche/Culturia/blob/474b5910a6f1a8541280f53bc6ae6b8458929e7f/src/grf3.scm [4] https://github.com/amirouche/Culturia/blob/474b5910a6f1a8541280f53bc6ae6b8458929e7f/src/ukv.scm
[5] https://amirouche.github.io/Culturia/doc/wiredtiger.html [6] https://amirouche.github.io/Culturia/doc/wiredtigerz.html Happy hacking! On 2016-08-13 17:25, Amirouche Boubekki wrote:
Héllo, The goal of Culturia is to create a framework that makes it easy to tape into Natural Language Understanding algorithms (and NLP) and provide an interface for common tasks. Culturia is an intelligence augmentation software. It's primary interface is a search engine. Another important aspect of the project is that it wants to be useable offline as such it willcome with infrastructure to dump, load and store dataset for offline use.The current state of the project can be described as a big ball of mud. There is a tiny search engine with crawling skills and that's basically all of it. The immediate changes that should happen are in order of preference: - offline stackoverflow (cf. sotoki.scm) and use the generated website to create a zim for kiwix [0]. This is great occasion to show how great GNU Guile is! - port whoosh/lucene to guile to improve text search - offline hackernews, wikidata, wikipedia, wiktionary - implement BM25f Culturia is a reference to _Culture and Empire_ by Pieter Hintjens. It has a sparse documentation is available online [1]. It's hosted on github [2] (This can change, if contributors don't want to use github). The TODO list is big, here is some stuff that needs to be done: - finish GrammarLink bindings - create sophia [3] bindings - implement TextRank - implement PageRank - create a GUI using sly or html - explore ways to easily share database among several processus And many other things! Newbies are accepted obviously! Send me a mail or use #guile @ irc.freenode.net, I am amz3. Happy hacking! [0] http://www.kiwix.org/wiki/Main_Page [1] https://amirouche.github.io/Culturia/doc/ [2] https://github.com/amirouche/Culturia [3] http://sophia.systems/
-- Amirouche ~ amz3 ~ http://www.hyperdev.fr
stopwords.en.txt
Description: Text document
text.scm
Description: Text document
wiredtiger.scm
Description: Text document
wiredtigerz.scm
Description: Text document
wsh.scm
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |