[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Performance questions: workspace_size default value and temp file di
From: |
John Darrington |
Subject: |
Re: Performance questions: workspace_size default value and temp file directory |
Date: |
Sat, 16 Mar 2013 13:33:41 +0100 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
On Fri, Mar 15, 2013 at 12:57:46PM +0100, Stefan Tzeggai wrote:
Hi everybody and thanks for this powerful piece of free software.
I use GNU pspp 0.7.9 (Fri Jun 29 19:31:48 UTC 2012) to batch convert CSV
to SAV files. The script basically does
GET DATA /TYPE=TXT
VARIABLE LABELS
VALUE LABELS
SAVE OUTFILE /COMPRESSED
My "bigger" CSV files are between 100MB and 1GB in filesize, 300
columns, 3000000 rows, mostly numerics. PSPP performance is pretty bad
on the big files. One single CPU core uses only 20%, top's wait flickers
up to 20%wa.
I started to investigate solutions and came up with these questions:
SET WORKSPACE=workspace_size
The maximum amount of memory that PSPP will use to store data being
processed. If memory in excess of the workspace size is required,
then PSPP will start to use temporary files to store the data.
Setting a higher value will, in general, mean procedures will run
faster, but may cause other applications to run slower. On platforms
without virtual memory management, setting a very large workspace
may cause PSPP to abort.
1. Question: This is the amount of in BYTES? Any more recommendation on
this setting? Will the amount be reserved on demand (a bit more, a bit
more, a bit more) while processing or fully as soon as the command is
executed?
What is the default value and how can I query the present setting? "SHOW
workspace;" did not work.
The value is in bytes. The default is 64 MB (64 * 1024 * 1024). It is a upper
limit,
so it will only be used if needed. It is a little more complex than that,
because
it is the maximum amount PER READER - some operations require multiple readers.
I don't know why SHOW WORKSPACE doesn't work. Maybe that's a bug.
When I set workspace=268435456 (256mb) the process uses 100% CPU and IO
wait is down. So it is an approach for more performance.
That is what I would expect. Basically, the bigger the workspace, the faster
the
processing. But clearly if the pspp engine is running at 100% CPU, then there
is nothing left for other processes. This can be an issue for people who are
using the GUI, and want it to remain responsive. Or if you want other
applications
to work while you are waiting for results to be processed.
When I provide a low WORKSPACE, the disk IO increases. Where are these
files stored? I could not find any hints in the documentation and I
could not see and files being created in /tmp? Is there an option to set
this directory?
You can see this if you type SHOW TEMPDIR. On my system it is indeed under
/tmp,
but this varies according to operating system. You can override it with the
TMPDIR
environment variable, or some operating systems have their own ways of defining
a
temporary directory. You might see a performance advantage if you set it to a
directory
which is mounted on a different physical disk from the one you are working on.
Any more ideas on performance? Can SAVE output be piped to zip-command
directly, so some more disk IO could be saved?
I suppose you could use a fifo, like this:
mkfifo myfifo
cat myfifo | gzip -c > foo.sav.gz &
pspp run.sps
where run.sps contains the line SAVE OUTFILE='myfifo'.
But I am unsure that it would provide any speed benefit.
If ALL you are trying to do is convert text to a .sav file, then running PSPP
is probably
not a good idea. It will be much faster if you write a small perl script which
uses the
perl modules which come with PSPP.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://keys.gnupg.net or any PGP keyserver for public key.
signature.asc
Description: Digital signature
Re: Performance questions: workspace_size default value and temp file directory, John Darrington, 2013/03/17