[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[O] Problems with org publish cache checking
From: |
Matt Lundin |
Subject: |
[O] Problems with org publish cache checking |
Date: |
Tue, 24 Nov 2015 09:14:47 -0600 |
User-agent: |
Gnus/5.130014 (Ma Gnus v0.14) Emacs/25.1.50 (gnu/linux) |
I've been doing some testing of org-publish functions and have found a
few problems with org-publish-cache-file-needs-publishing. They arise
from the fact that it attempts to take included files into account.
The logic is simple enough: while a file may not have changed, the files
it includes may have. So during the publishing process the function
scans every file in a project for #+INCLUDE keywords, comparing the last
modified time of those included files against the timestamps of the
included files stored in the cache. However, there are several
limitations:
1. Unlike org-export-expand-include-keyword,
org-publish-cache-file-needs-publishing takes no account of recursive
includes: i.e., included files within included files.
2. It does not cache timestamps for included files that are not also
project files (i.e.,, files stored outside of the project or excluded
via the :exclude plist option). Since org-publish caches the
timestamps of only those files that are published directly (i.e., not
as includes), the result is that files that files that include files
outside of a publishing project are always republished.
3. It is slow!!! The function visits every file in a project to check
for #+INCLUDE declarations, thus offsetting much of the benefit of
caching timestamps. To test this, I created a dummy project with over
1000 pages (not typical usage, of course, but possible for someone
writing a blog over several years or creating a large interlinked
wiki).
During the first publishing run on an old (2007) duo-core machine,
org-mode generated the entire site in 3 minutes (not bad). However,
over 40 seconds of that time was spent by
org-publish-cache-file-needs-publishing (something that is entirely
redundant on the first publishing run).
--8<---------------cut here---------------start------------->8---
org-publish-all 1 180.82396367 180.82396367
org-publish-projects 1 180.82375580 180.82375580
org-publish-file 1008 180.41644274 0.1789845662
org-publish-org-to 1000 138.45729874 0.1384572987
org-publish-needed-p 1008 41.538426420 0.0412087563
org-publish-cache-file-needs-publishing 1008 41.210540305 0.040883472
--8<---------------cut here---------------end--------------->8---
During subsequent runs, publishing still took over 40 seconds, despite
the existence of the cache. This is chiefly because
org-publishing-cache-file-needs-publishing checks every file for includes:
--8<---------------cut here---------------start------------->8---
org-publish-all 1 41.335711491 41.335711491
org-publish-projects 1 41.335444938 41.335444938
org-publish-file 1008 40.918752137 0.0405940001
org-publish-needed-p 1008 40.669991543 0.0403472138
org-publish-cache-file-needs-publishing 1008 40.566117665 0.040244164
--8<---------------cut here---------------end--------------->8---
Perhaps the simplest solution to all this would be to give users an
option to turn off checking for #+INCLUDE declarations. This would
reduce subsequent publishing runs to a mere second, so long as one does
not use included files.
A more complex solution would be to cache the names of included files
and to store timestamps for the included files if they are outside of
the project (optionally including recursive logic). I am still trying to
figure out the best way to do this.
Advice on how to proceed would be greatly appreciated.
Thanks,
Matt
- [O] Problems with org publish cache checking,
Matt Lundin <=