guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: File search progress: database review and question on triggers


From: zimoun
Subject: Re: File search progress: database review and question on triggers
Date: Sun, 11 Oct 2020 15:02:52 +0200

Hi Pierre,

I am trying to resume the work on "guix search" to improve it (faster).
That's why I am asking the details.  :-)
Because with the introduction of this database, as mentioned earlier,
2 annoyances could be fixed at once.


On Sun, 11 Oct 2020 at 13:19, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> > --8<---------------cut here---------------start------------->8---
> > echo 3 > /proc/sys/vm/drop_caches
> > time updatedb --output=/tmp/store.db --database-root=/gnu/store/
> >
> > real    0m19.903s
> > user    0m1.549s
> > sys     0m4.500s
>
> I don't know the size of your store nor your hardware.  Could you
> benchmark against my filesearch implementation?

30G as I reported in my previous email. ;-)


> > From my point of view, yes.  Somehow “filesearch” is a subpart of
> > “search”.  So it should be the machinery.
>
> I'll work on it.  I'll try to make the code flexible enough so that it
> can be moved to another command easily, should we decide that "search"
> is not the right fit.

UI does not matter so much at this point, I guess.  But the nice final
UI should be:

   guix search --file=


> > For example, I just did “guix pull” and “–list-generation” says from
> > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
> >
> >    39.9 MB will be download
> >
> > more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
> >
> > Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> > ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
> >
> >       The database will all package descriptions and synopsis is 46 MiB
> >       and compresses down to 11 MiB in zstd.
>
> I should have benchmarked with Lzip, it would have been more useful.  I
> think we can get it down to approximately 8 MiB in Lzip.

Well, I think it will be more with all the items of all the packages.
My point is: the database will be comparable in size with the bits of
"guix pull"; it is not much but still something.


> > which is better but still something.  Well, it is not affordable to
> > fetch the database with “guix pull”, In My Humble Opinion.
>
> We could send a "diff" of the database.

This means to setup server side, right?  So implement the "diff" in
"guix publish", right?  Hum? I feel it is overcomplicated.


> For instance, if the user already has a file database for the Guix
> generation A, then guix pulls to B, the substitute server can send the
> diff between A and B.  This would probably amount to less than 1 MiB if
> the generations are not too far apart.  (Warning: actual measures needed!)

Well, what is the size of for a full /gnu/store/ containing all the
packages of one specific revision?  Sorry if you already provided this
information, I have missed it.


> > Therefore, the database would be fetched at the first “guix search”
> > (assuming point above).  But now, how “search” could know what is custom
> > build and what is not?  Somehow, “search” should scan all the store to
> > be able to update the database.
> >
> > And what happens each time I am doing a custom build then “filesearch”.
> > The database should be updated, right?  Well, it seems almost unusable.
>
> I mentioned this previously: we need to update the database on "guix
> build".  This is very fast and would be mostly transparent to the user.
> This is essentially how "guix size" behaves.

Ok.


> > The model “updatedb/locate” seems better.  The user updates “manually”
> > if required and then location is fast.
>
> "manually" is not good in my opinion.  The end-user will inevitably
> forget.  An out-of-sync database would return bad results which is a
> big no-no for search.  On-demand database updates are ideals I think.

The tradeoff is:
  - when is "on-demand"?  When updates the database?
  - still fast when I search
 - do not slow down other guix subcommands


What you are proposing is:

 - when "guix search --file":
     + if the database does not exist: fetch it
     + otherwise: use it
 - after each "guix build", update the database

Right?

I am still missing the other update mechanism for updating the database.

(Note that the "fetch it" could be done at "guix pull" time which is
more meaningful since pull requires network access as you said.  And
the real computations for updating could be done at the first "guix
search --file" after the pull.)


> Possibly using a "diff" to shrink the download size.
>
> >  - otherwise: use this database
> >  - optionally update the database if the user wants to include new
> >  custom items.
>
> No need for the optional point I believe.

Note that since the same code is used on build farms and their store
is several TB (see recent discussion about "guix gc" on Berlin that
takes hours), the build and update of the database need some care. :-)


> >> - Find a way to garbage-collect the database(s).  My intuition is that
> >>   we should have 1 database per Guix checkout and when we `guix gc` a
> >>   Guix checkout we collect the corresponding database.
> >
> > Well, the exact same strategy as
> > ~/.config/guix/current/lib/guix/package.cache can be used.
>
> Oh!  I didn't know about this file!  What is it used for?

Basically for "--news".  Otherwise, it is used by
"fold-available-packages", "find-packages-by-name" and
"find-packages-by-location".  It is used only if "--load-path" is not
provided (cache-is-authoritative?).  And it is computed at the end
"guix pull".  The discussions about improving "guix search" was first
to replace it by SQL database, then to add another file mimicking it,
then to extend it (which leads to backward compatibility issues).

For example, compare:

--8<---------------cut here---------------start------------->8---
time guix package --list-available > /dev/null

real    0m1.025s
user    0m1.866s
sys     0m0.044s

time guix package --list-available -L /tmp/foo > /dev/null

real    0m4.436s
user    0m6.734s
sys     0m0.124s
--8<---------------cut here---------------end--------------->8---

The first uses the case, the second not.


Cheers,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]