pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Excessive file system usage


From: Dave Trollope
Subject: Re: Excessive file system usage
Date: Tue, 24 Dec 2019 10:21:28 -0600

I ran some more tests on this, and found that there is a temp file being
stored in a dir /tmp/pspp* and that file is where all the space is going
while its writing the actual csv.

I tested pspp-convert as Ben P suggested and can confirm that it doesn't
suffer this same issue, just using the space to write the csv.
Unfortunately, we aren't ready to switch to it because its doesn't support
labels or variable selection. That will be something to address next year.

Cheers
Dave

On Wed, Dec 4, 2019 at 9:54 PM Dave Trollope <address@hidden>
wrote:

> I can confirm this same behavior exists on non docker environments. I went
> back to my trusty dev vm running in virtual box and see the same behavior.
> Here is the config of the vm:
>
> deploy@app1[local]:~$ uname -a
> Linux app1 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
> deploy@app1[local]:~$ cat /etc/issue
> Ubuntu 16.04.6 LTS \n \l
>
> And monitoring the filesystem behavior of /dev/sda1:
>
> Starting a conversion:
>
> Every 1.0s: df -H /; ls -ltr /tmp Thu Dec 5 03:49:26 2019
>
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda1 26G 8.9G 18G 35% /
> total 166384
> drwxr-xr-x 3 www-data www-data 4096 Dec 5 03:41 resources
> drwxr-xr-x 6 root root 4096 Dec 5 03:41 vagrant-chef
> -rw-rw-r-- 1 deploy deploy 158002235 Dec 5 03:46 k42tn4xv.csv
> -rw------- 1 deploy deploy 0 Dec 5 03:48 90ctcu3h.csv
> drwx------ 2 deploy deploy 4096 Dec 5 03:48 pspp92Vei8
> -rw-rw-r-- 1 deploy deploy 12357632 Dec 5 03:48 90ctcu3h.csvtmphbUHkN
>
> At end of conversion:
>
> Every 1.0s: df -H /; ls -ltr /tmp Thu Dec 5 03:49:26 2019
>
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda1 26G 16G 11G 60% /
> total 270324
> drwxr-xr-x 3 www-data www-data 4096 Dec 5 03:41 resources
> drwxr-xr-x 6 root root 4096 Dec 5 03:41 vagrant-chef
> -rw-rw-r-- 1 deploy deploy 158002235 Dec 5 03:46 k42tn4xv.csv
> -rw-rw-r-- 1 deploy deploy 118785163 Dec 5 03:48 90ctcu3h.csv
> drwx------ 2 deploy deploy 4096 Dec 5 03:48 pspp92Vei8
>
> You’ll see the file system dropped 7gb but only 118mb is written.
>
> I will try pspp-convert as Ben suggested and report back.
>
> Cheers
> Dave
>
> On Dec 4, 2019, 2:06 PM -0600, Dave Trollope <address@hidden>,
> wrote:
>
> Once the conversion is complete the space is returned so its not a long
> term problem - only during the conversion. This became an issue because in
> kubernetes you control your resources much more tightly and that’s why this
> was highlighted.
>
> I’m not sure there is anything special about the SAV files, so yes I would
> expect it to be easily reproducible - but at this point I don’t know what I
> don’t know that might be relevant ;-)
>
> I will try running the same thing on a regular ec2 vs docker as mentioned
> in my earlier email and verify if this is truly unique to docker based
> environments - but my gut tells me it is not, we just didn’t notice before
> because we had lots of space on the machine.
>
> Cheers
> Dave
> On Dec 4, 2019, 11:15 AM -0600, Alan Mead <address@hidden>, wrote:
>
> I'm curious to see what the devs say. I think they use Debian, but I don't
> know about docker.
>
> So is the excessive disk space used and then returned and when pspp is
> done, so only 150MB are consumed? Or is it that many GB of storage seem to
> disappear (so maybe the file shows a CSV file size of 150MB but the docker
> container 7gb bigger)?
>
> If I wanted to replicate the behavior, are there any special aspects to
> the datafiles? I'd create a SAV file with a few columns and enough rows of
> random data to make a 1GB SAV file. Right?
> Then I'd run your script to create the CSV. Right? And if I did this on a
> stock Linux host without docker/ramfs/etc., I wouldn't see 7GB of space
> consumed during the conversion, but if I then arranged to do the same test
> using docker or ramfs, I would? Is that correct?
>
> If so, that seems to indicate something to do with docker/ramfs, right?
> Or, you're saying this would affect a physical linux host equally?
>
> -Alan
>
>
> On 12/4/2019 9:24 AM, Dave Trollope wrote:
>
> Hi Alan,
>
> Sorry, yes I forgot to mention this is linux, Debian GNU/Linux 9
> Linux e1e6db1d8408 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019
> x86_64 GNU/Linux
>
> I’ve reproduced this behavior in kubernetes and outside kubernetes in a
> raw docker container so its not kubernetes specific but may be related to
> the way the containerized image is built in docker.
>
> We haven’t observed this on our standard ec2, but to be honest we haven’t
> monitored in the same way - I can try that and see. We have enough space
> there that it could have gone unnoticed. I will try.
>
> What I'm doing is watching the filesystem as the SAVE TRANSLATE command is
> running, using watch -n 0.5 "df -H; ls -ltr /tmp"
>
> The only file being written is the csv but the filesystem used space is
> dropping at a much higher rate than data being written. No other temp files
> are being placed in /tmp
>
> I also reproduced this using a ram based fs - if you watch the usage it
> behaves the same so I don't think its specific to dockerized filesystems,
> but I might yet be wrong on that.
>
> The link you share is a common problem when starting out with containers
> where the build process creates lots of images. As you build lots of
> images, you have to cleanup. Its one of the first things you learn as you
> step in to the container world!
>
> Appreciate the quick reply. It certainly was a shocking observation when I
> found it :-)
>
> Cheers
> Dave
>
>
> On Dec 4, 2019, 8:29 AM -0600, Alan Mead <address@hidden>
> <address@hidden>, wrote:
>
> Wow, that's a lot. Do you mean that 7GB of space are needed (for, I guess
> temporary files)? And you did not observe that previously?
>
> Maybe the devs are familiar with kubernetes; I only know the name. Can you
> describe the environment (e.g., OS)? And pspp version? How many conversions
> have you observed this behavior?
>
> And you're sure this isn't a kubernetes problem (like it's making
> snapshots as it writes the file or something)? I ask because when I google
> about this, it looks like there are sharp edges; glancing through, these
> don't seem to directly and specifically address the behavior you're seeing,
> but it looks like there could be these kinds of issues with kubernetes and
> the PSPP devs wouldn't be able to help unless they knew kubernetes:
>
>
> https://cntnr.io/whats-eating-my-disk-docker-system-commands-explained-d778178f96f1
>
> https://softwareengineeringdaily.com/2019/01/11/why-is-storage-on-kubernetes-is-so-hard/
>
> -Alan
>
>
> On 12/4/2019 6:40 AM, Dave Trollope wrote:
>
> We just moved Pspp to Kubernetes containers where we use it to extract csvs 
> from sav files. The sav files are about 1gb and each csv is about 150mb.
>
> We’ve watched the file system as it does it and over 7gb of the file system 
> is used while writing 150mb. I assume the SAVE command is doing lots of seeks 
> and insertions in the file magnifying the file system usage. Any options to 
> limit this behavior?
>
> Here is the script we are using
> GET FILE = "{}"
>
> SAVE TRANSLATE
>   /OUTFILE="{}"
>   /TYPE=CSV
>   /FIELDNAMES
>   /REPLACE
>   /KEEP={}
>   /MISSING=RECODE
>   /CELLS=LABELS.
> Cheers
> Dave
>
>
>
> --
>
> Alan D. Mead, Ph.D.
> President, Talent Algorithms Inc.
>
> science + technology = better workers
> http://www.alanmead.org
>
> The irony of this ... is that the Internet is
> both almost-infinitely expandable, while at the
> same time constrained within its own pre-defined
> box. And if that makes no sense to you, just
> reflect on the existence of Facebook. We have
> the vastness of the internet and yet billions
> of people decided to spend most of them time
> within a horribly designed, fake-news emporium
> of a website that sucks every possible piece of
> personal information out of you so it can sell it
> to others. And they see nothing wrong with that.
>
> -- Kieren McCarthy, commenting on why we are not
>                     all using IPv6
>
>
> --
>
> Alan D. Mead, Ph.D.
> President, Talent Algorithms Inc.
>
> science + technology = better workers
> http://www.alanmead.org
>
> The irony of this ... is that the Internet is
> both almost-infinitely expandable, while at the
> same time constrained within its own pre-defined
> box. And if that makes no sense to you, just
> reflect on the existence of Facebook. We have
> the vastness of the internet and yet billions
> of people decided to spend most of them time
> within a horribly designed, fake-news emporium
> of a website that sucks every possible piece of
> personal information out of you so it can sell it
> to others. And they see nothing wrong with that.
>
> -- Kieren McCarthy, commenting on why we are not
>                     all using IPv6
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]