Re: One vs many directories

emacs-orgmode
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: One vs many directories

From:	Ihor Radchenko
Subject:	Re: One vs many directories
Date:	Mon, 23 Nov 2020 22:16:24 +0800
Dear Jean Louis,

Your description of the database reminds me how org-roam handles the
files - it also uses an external database for linking and allows quick
incremental search that does not really depend on where the
files/headings are stored.

However, what you are talking about is against org-mode philosophy, as I
know it. Currently, the dev's stance is that org files must be
self-sufficient. Org-mode should not depend on external database to
manage the org files and operations with them. Everything must be plain
text! Moreover, some devs are even reluctant to hide metadata (like
unique ID implemented in org-id module) from users (which is possible
and also partially implemented).

Best,
Ihor


Jean Louis <bugs@gnu.support> writes:

> * Texas Cyberthal <texas.cyberthal@gmail.com> [2020-11-23 12:51]:
>> Hi Dr. Arne,
>> 
>> > The only part that hits performance limits is the agenda.
>> 
>> Well, IIRC your Org Textmind is much smaller than mine.
>> 
>> > My current guess is that the agenta is slow because it has to parse all my 
>> > 7500 clock entries, and it has to check the Todo states of around 1200 
>> > headings.
>> 
>> Ouch.  I'd rather keep a "ramble log" so I can reconstruct an exactly
>> honest time accounting, with discounts for partial attention, without
>> worrying about fiddly clockin/outs.  At least when working from home.
>> If clocking into a work site, that's different, because one can
>> reasonably bill for the entire time, with minimal clock toggling.
>> 
>
>> > Did you check against filesystem limits? At 10k entries in a
>> directory typical filesystems start becoming slow. That's the main
>> reason I see for adding hierarchies.
>
> From ext4 manual:
>
>  dir_index
>   Use hashed  b-trees to speed  up name lookups  in large
>   directories.   This feature  is supported  by ext3  and
>   ext4 file systems, and is ignored by ext2 file systems.
>
>  dir_nlink
>   This ext4 feature allows more than 65000 subdirectories
>   per directory.
>
> I think that file systems should be unlimited and fast in relation to
> that. I have ~/Maildir with over 50000 subdirectories, direct access
> is very easy and fast while listing takes some time.
>
> If file system does not allow fast access it is time to replace it
> with one that does allow it.
>
> Now I wonder of HAMMER in DragonflyBSD is also slow with 50000
> directories.
>
> My PostgreSQL database is not huge, it is when packed about 50 MB. On
> the file system it is 810 MB.
>
> To select 2469 contacts as subset of 204048 contacts that belong in
> certain group does not give (usually) feeling of any delay, it looks
> instant for human.
>
> My Org work is on meta-level so my truly important headings or subtree
> names are in the database. Subtrees have their various properties,
> like I can place any tags there inside, like TODO or designate type of
> TODO. My work is intertwined with text and Org mode mostly, but I
> could use any kind of mime type or any kind of Emacs mode. Some nodes
> are on file system while some are in the database.
>
> Nodes within subtree are hyperdocuments, they are all linkable and
> could be on file system or not on file system.
>
> Everything is together in one tree and it does not matter as access to
> the nodes does not go over the tree necessary. There are 19197 nodes.
> To find 76 that are tagged with TODO does not give me any slight or
> visible delay, definitely not even 0.2 seconds. When I press enter it
> is right there.
>
> From the system I am using personally I am thinking that Org mode
> could get its database connection so that headings and properties are
> managed in the database fully while text could be managed in files. It
> seems very possible.
>
> The only thing that would be needed to add to Org in that case is some
> heading tag that would uniquely designate where in the database that
> heading is managed. It could be very lightly displayed on the screen
> and would not be exported by default.
>
> Something like
>
> *** TODO Heading                                     :ID-123:
>
> That would be all. All other meta data belonging to the heading could
> be managed in the database. If heading is deleted it need not be
> deleted in the database. Text belonging to heading could be managed in
> the text file. Properties in the database. It can be simple database
> such as GDBM if such is fast enough.
>
> Meta data for the heading would or could be updated automatically from
> time to time.
>
> User could easily decide to show the properties in the Org file or not
> to show. It does not matter much as long as :ID-123: tag is there.
>
> All things like tags, properties, clock-in and out, schedule,
> deadlines, custom_id and everything else as heading meta data could be
> manageable in the database. It could be copied into new headings.
> Creation of heading like this:
>
> *** TitleRET
>
> would automatically invoke creation of heading 124 in the database and it 
> would appear as:
>
> *** Title                                          :ID-124;
>
> From there on user would be doing anything as usual in the Org mode
> with the difference that properties would be displayed in the updated
> manner and would not be really in the Org file. They would be
> displayed on the fly. Any properties and plethora of other new
> properties could be included.
>
> System would recognize automatically by saving the Org file or by
> opening it:
>
> - If headings are in the right file, if file changed its place it
> would be automatically updated in the database. 
>
> - the heading ID would always remain unique no matter what, so users
> linking to any heading would not need to worry of title remaining. The
> unique ID that links to heading would basically link to the database
> entry. Opening the link would ask database where the entry is located
> and it would open up proper Org file at proper location without
> parsing the Org file in usual manner. Org file would then remain
> pretty much more text than it is now.
>
> - all the parsing and searching and indexing would be automatically
> solved and human readable SQL queries could be easily customized by
> user. Suddenly there would be much less commotion in work. Org files
> would look much more humane readable then they are now.
>
>> 10k entries in a directory sounds inhumanely unergonomic.  I guess my
>> biggest flat name directory might eventually reach that size?  In
>> which case I could just split it in the middle of the alphabet, or
>> similar solution.
>
> Like by first letters, like
>
> ~/Maildir/a/d/a/adam@example.com
>
> Such sorting of files would be automatic. You would need to invoke a
> command that sorts files that way automatically and that may also
> quickly access such files automatically.
>
> I have comand that I often use, mkdatedir that makes me directory for
> the current date.
>
> If I wish to make a database note for the day, the command today-note would 
> make sure there is:
>
> - Year 2020 (formatted how I customize it)
>   - November (also formatted by custom)
>     - 2020-11-23
>       And entry is automatically opened for the note.
>
> The system helps that I locate quickly the note that relates to the
> day. But I can put multiple notes under same date and I can also have
> same titles for those multiple notes. This is because each note has
> its unique ID.
>
> I do not know how Org handles multiple same headings when linking to
> it. It does not by default:
>
> [[Heading][Heading]]
>
> * Heading
>
>   Text here
>   
> * Heading
>
>
>   More text here. But if I wish to link here I need to do hac
>
> To me and my thinkin that is not really logical. There shall be always
> unique ID for each heading. My mind is not comforted by Org system in
> that sense. And I should not be thinking of the unique ID neither I
> should be writing those links like [[Something][Here]] as they should
> be constructed automatically.
>
> Myself I would like to come with cursor to second Heading and capture
> the link to the heading. I would kill [[Heading][Heading]] into
> special memory for those links. Then I could go to any other place in
> the Org file and insert it there without thinking how link looks like
> or constructing the link myself as it already exists in front of me.
>
> Constructing links by hand is fine for those which are external.
>
> Headings of Org files could be managed by the database in background.
> Then all that distributed or sparse meta information (mess) disappears. 
>
> What people are now trying to handle with Org files is management of a
> database. Only that entries of the database are pretty much
> disconnected from each other, vague, in unknown positions, then Org
> algorhitms try to manage that all everything what is anyway built-in
> in all SQL databases. Mess is growing over time.
>
>> A 10k entry directory is getting into enterprise territory, and I'm
>> sure enterprise has tech tricks that become worthwhile at that scale.
>
> I will try with those options dir_index and dir_nlink to see if my
> 50000+ directory becomes somewhat faster. Direct access to the
> subdirectory is always very fast. I almost never do ls there neither
> enter any such directory manually. They store emails, so I just click
> one key in mutt, that key extracts the current email address such as
> person@example.com and opens up ~/Maildir/person@example.com, one
> among 50000. It is accessed by wanting to see previous conversation
> with the contact, not by knowing what is the directory name or email
> address, computer does that. It is simple system I use for years and
> it is blazing fast.
>
>> There are scaling problems in every direction: Too many files per >
>> directory, too large files, too much content per heading, too many >
>> headings.
>
> To list more than 200,000 contacts does take some time but access to
> the list from database is so much faster than ls in the ~/Maildir with
> more than 50000 entries or subdirectories. I can relate to that. And I
> still think that file systems should manage any numbers of entries.
>
>> There are scaling problems from too much deep tree nesting, namely too
>> much fiddly ambiguous manual refiling.  Solution is flat "solid name"
>> directories just below feasible 10 Bins.  Work fine.
>
> I have tried your solution and could not find the mental concept to
> relate to my thinking. And I do agree that such solution could help
> other people.
>
> For images I have some command like `sort-images.lisp' that just sorts
> images by its embedded dates. Many times I sort even downloads per day.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: One vs many directories, (continued)
Prev by Date: Re: One vs many directories
Next by Date: ob-python: import local package into a session
Previous by thread: Re: One vs many directories
Next by thread: Is Org really so simple?
Index(es):
- Date
- Thread