guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ANN] hyper search engine


From: Amirouche Boubekki
Subject: [ANN] hyper search engine
Date: Sun, 08 May 2016 21:01:31 +0200
User-agent: Roundcube Webmail/1.1.2

Héllo all!

# What?

I happy to share with you the code of my **search engine** project named
**hyper**. This is a developer preview, it's not even user friendly
and stuff.

It implements basically the following:

- it support http and https
- extract words as tokens from html pages
- extract links and crawl websites with a certains depth
- index urls using the extracted tokens
- search urls for multiple tokens

There is no ranking nor a web ui yet.

# Why?

It's an interesting project!

# How?

It use Guile all over the place!

## Database

It's based using Guile bindings of wiredtiger database storage engine.

On top of wiredtiger is built the tuple space (maybe it's a RDF
database) called UAV (which has only 3 or 4 primitives). It has a
SPARQL-like query engine (if I'm not mistaken about SPARQL) written on
top of minikanren. This is sugar syntax, anyway, it's fun enough to be
mentioned. Querying happens mostly through the SPARQL-like syntax
because it's so much fun.

The UAV database is exposed over a socket using a read-eval-return
loop. Which means that client side, you almost query the database as
if it was local (look for `db` procedure call in the code). It's
almost as convenient, it's not perfect. Prolly, stored procedure will
be the future. I want to delay the switch to stored procedures as much
as possible (you can still argue!).

The UAV database is schema-less and I think it'd like to keep it like
that until a better idea. I choosed to use this because I feel more
confortable with it than with a hypergraphdb or graphdb that I also
developed. I also think that a tuple database is more schemey. That's
also the reason why I did not choose a fixed schema.

You can think of hyper's database as a Redis with Guile RPC with a real
storage engine as backend (with MVCC, transactions...).

Even if the database store only a set of 3-items lists, they are
links between tuples so there is an underlying graph schema.

It's single threaded.

## What else?

- It requires wiredtiger **2.6.1**
- It requires guile-gnutls
- It requires html2text.

Also it use htmlprag html parser from guile-lib, but it's embedded in
the source tree.

# Getting started

Start with cloning the code and switch to the created directory:

```
$ git clone https://framagit.org/a-guile-mind/hyper.git
$ cd hyper
```

Create the `db` directory and start the database server from within
the `hyper` directory:

```
hyper $ mkdir db
hyper $ guile -L . hyper-server.scm
```

This simply the UAV database server running on a socket.

Now that the database server is running, you can add some seed urls,
for crawling you know! Fire a guile REPL using the following command:

```
hyper $ guile -L .
```

And try to reproduce the following REPL run:

```scheme
scheme@(guile-user)> (use-modules (hyper))
scheme@(guile-user)> (add-url "http://hypermove.net";)
$1 = #t
scheme@(guile-user)> (add-url "http://gnu.org";)
$2 = #t
```

Keep the REPL around.

Now you are ready to crawl, if you are lucky... Try the following in
another terminal:

```
hyper $ guile -L . hyper-worker.scm
```

This should start spitting some stuff explaining that it's extracting,
crawling, failing, indexing the web.

Now you probably want to **search** the web! No easier way, than using
a Guile REPL. Maybe you have still one around, then try the following:

```scheme
scheme@(guile-user)> (use-modules (hyper))
scheme@(guile-user)> (map pk (search* "guile"))
...
scheme@(guile-user)> (map pk (search* "hypermove" "guile"))
...
```

Hourra!! You (prolly) made your first really **free search** on the web
using Guile!

# Where to go from here

I am really interested by the project, but had to fight to get this
done as they are various issues... coding you know...

Like I said earlier this is a developer preview mostly to try to get
people interested in the project and hopefully get some kind of
contributions (patch, jokes, review, comments, snacks...).

I warmly recommend you to fork the project and goof around.

Here is myhypertodolist in no particular order:

- improve url retrieval, right now, there is an issue where some
  relative href are not properly interpreted
- switch to some async framework
- add support for multiple threads in hyper-server.scm (uav server actually)
- add support for multiple workers
- fix uav.scm to spawn a server when run
- improve indexing to store the title of the page
- improve indexing to not index urls that are not text based on mime/type
- improve indexing to add support for pdf/ps/odt files
- investigate why some urls can not be indexed
- improve crawling to override crawling height with the biggest height (hyper-worker.scm)
- improve crawling to make use of a crawling date
- improve crawling to make use of a freshness/liveness logic
- improve crawling to support robot.txt
- improve crawling to follow redirections
- improve crawling and indexing to make use of websites' REST APIs
(instead of crawling html, we crawl the REST API, useful for website like
  StackOverflow, Quora, hackernews)
- improve search with ranking (page rank, TF-IDF, BM25...)
- update wiredtiger binding to latest version of wiredtiger
- create some kind of UI that is not the REPL


Feel free to contact me in private, if you wish to discuss whatever you want!


Happy hacking!


--
Amirouche ~ amz3 ~ http://www.hyper{dev.fr,move.net}



reply via email to

[Prev in Thread] Current Thread [Next in Thread]