pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Excessive file system usage


From: Dave Trollope
Subject: Re: Excessive file system usage
Date: Wed, 4 Dec 2019 21:53:59 -0600

I can confirm this same behavior exists on non docker environments. I went back 
to my trusty dev vm running in virtual box and see the same behavior. Here is 
the config of the vm:

deploy@app1[local]:~$ uname -a
Linux app1 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
deploy@app1[local]:~$ cat /etc/issue
Ubuntu 16.04.6 LTS \n \l

And monitoring the filesystem behavior of /dev/sda1:

Starting a conversion:

Every 1.0s: df -H /; ls -ltr /tmp Thu Dec 5 03:49:26 2019

Filesystem Size Used Avail Use% Mounted on
/dev/sda1 26G 8.9G 18G 35% /
total 166384
drwxr-xr-x 3 www-data www-data 4096 Dec 5 03:41 resources
drwxr-xr-x 6 root root 4096 Dec 5 03:41 vagrant-chef
-rw-rw-r-- 1 deploy deploy 158002235 Dec 5 03:46 k42tn4xv.csv
-rw------- 1 deploy deploy 0 Dec 5 03:48 90ctcu3h.csv
drwx------ 2 deploy deploy 4096 Dec 5 03:48 pspp92Vei8
-rw-rw-r-- 1 deploy deploy 12357632 Dec 5 03:48 90ctcu3h.csvtmphbUHkN

At end of conversion:

Every 1.0s: df -H /; ls -ltr /tmp Thu Dec 5 03:49:26 2019

Filesystem Size Used Avail Use% Mounted on
/dev/sda1 26G 16G 11G 60% /
total 270324
drwxr-xr-x 3 www-data www-data 4096 Dec 5 03:41 resources
drwxr-xr-x 6 root root 4096 Dec 5 03:41 vagrant-chef
-rw-rw-r-- 1 deploy deploy 158002235 Dec 5 03:46 k42tn4xv.csv
-rw-rw-r-- 1 deploy deploy 118785163 Dec 5 03:48 90ctcu3h.csv
drwx------ 2 deploy deploy 4096 Dec 5 03:48 pspp92Vei8

You’ll see the file system dropped 7gb but only 118mb is written.

I will try pspp-convert as Ben suggested and report back.

Cheers
Dave

On Dec 4, 2019, 2:06 PM -0600, Dave Trollope <address@hidden>, wrote:
> Once the conversion is complete the space is returned so its not a long term 
> problem - only during the conversion. This became an issue because in 
> kubernetes you control your resources much more tightly and that’s why this 
> was highlighted.
>
> I’m not sure there is anything special about the SAV files, so yes I would 
> expect it to be easily reproducible - but at this point I don’t know what I 
> don’t know that might be relevant ;-)
>
> I will try running the same thing on a regular ec2 vs docker as mentioned in 
> my earlier email and verify if this is truly unique to docker based 
> environments - but my gut tells me it is not, we just didn’t notice before 
> because we had lots of space on the machine.
>
> Cheers
> Dave
> On Dec 4, 2019, 11:15 AM -0600, Alan Mead <address@hidden>, wrote:
> > I'm curious to see what the devs say. I think they use Debian, but I don't 
> > know about docker.
> >
> > So is the excessive disk space used and then returned and when pspp is 
> > done, so only 150MB are consumed? Or is it that many GB of storage seem to 
> > disappear (so maybe the file shows a CSV file size of 150MB but the docker 
> > container 7gb bigger)?
> >
> > If I wanted to replicate the behavior, are there any special aspects to the 
> > datafiles? I'd create a SAV file with a few columns and enough rows of 
> > random data to make a 1GB SAV file. Right?
> > Then I'd run your script to create the CSV. Right? And if I did this on a 
> > stock Linux host without docker/ramfs/etc., I wouldn't see 7GB of space 
> > consumed during the conversion, but if I then arranged to do the same test 
> > using docker or ramfs, I would? Is that correct?
> >
> > If so, that seems to indicate something to do with docker/ramfs, right? Or, 
> > you're saying this would affect a physical linux host equally?
> >
> > -Alan
> >
> >
> > On 12/4/2019 9:24 AM, Dave Trollope wrote:
> > > Hi Alan,
> > >
> > > Sorry, yes I forgot to mention this is linux, Debian GNU/Linux 9
> > > Linux e1e6db1d8408 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 
> > > x86_64 GNU/Linux
> > >
> > > I’ve reproduced this behavior in kubernetes and outside kubernetes in a 
> > > raw docker container so its not kubernetes specific but may be related to 
> > > the way the containerized image is built in docker.
> > >
> > > We haven’t observed this on our standard ec2, but to be honest we haven’t 
> > > monitored in the same way - I can try that and see. We have enough space 
> > > there that it could have gone unnoticed. I will try.
> > >
> > > What I'm doing is watching the filesystem as the SAVE TRANSLATE command 
> > > is running, using watch -n 0.5 "df -H; ls -ltr /tmp"
> > >
> > > The only file being written is the csv but the filesystem used space is 
> > > dropping at a much higher rate than data being written. No other temp 
> > > files are being placed in /tmp
> > >
> > > I also reproduced this using a ram based fs - if you watch the usage it 
> > > behaves the same so I don't think its specific to dockerized filesystems, 
> > > but I might yet be wrong on that.
> > >
> > > The link you share is a common problem when starting out with containers 
> > > where the build process creates lots of images. As you build lots of 
> > > images, you have to cleanup. Its one of the first things you learn as you 
> > > step in to the container world!
> > >
> > > Appreciate the quick reply. It certainly was a shocking observation when 
> > > I found it :-)
> > >
> > > Cheers
> > > Dave
> > >
> > >
> > > On Dec 4, 2019, 8:29 AM -0600, Alan Mead <address@hidden>, wrote:
> > > > Wow, that's a lot. Do you mean that 7GB of space are needed (for, I 
> > > > guess temporary files)? And you did not observe that previously?
> > > >
> > > > Maybe the devs are familiar with kubernetes; I only know the name. Can 
> > > > you describe the environment (e.g., OS)? And pspp version? How many 
> > > > conversions have you observed this behavior?
> > > >
> > > > And you're sure this isn't a kubernetes problem (like it's making 
> > > > snapshots as it writes the file or something)? I ask because when I 
> > > > google about this, it looks like there are sharp edges; glancing 
> > > > through, these don't seem to directly and specifically address the 
> > > > behavior you're seeing, but it looks like there could be these kinds of 
> > > > issues with kubernetes and the PSPP devs wouldn't be able to help 
> > > > unless they knew kubernetes:
> > > >
> > > > https://cntnr.io/whats-eating-my-disk-docker-system-commands-explained-d778178f96f1
> > > > https://softwareengineeringdaily.com/2019/01/11/why-is-storage-on-kubernetes-is-so-hard/
> > > >
> > > > -Alan
> > > >
> > > >
> > > > On 12/4/2019 6:40 AM, Dave Trollope wrote:
> > > > > We just moved Pspp to Kubernetes containers where we use it to 
> > > > > extract csvs from sav files. The sav files are about 1gb and each csv 
> > > > > is about 150mb.
> > > > >
> > > > > We’ve watched the file system as it does it and over 7gb of the file 
> > > > > system is used while writing 150mb. I assume the SAVE command is 
> > > > > doing lots of seeks and insertions in the file magnifying the file 
> > > > > system usage. Any options to limit this behavior?
> > > > >
> > > > > Here is the script we are using
> > > > > GET FILE = "{}"
> > > > >
> > > > > SAVE TRANSLATE
> > > > >  /OUTFILE="{}"
> > > > >  /TYPE=CSV
> > > > >  /FIELDNAMES
> > > > >  /REPLACE
> > > > >  /KEEP={}
> > > > >  /MISSING=RECODE
> > > > >  /CELLS=LABELS.
> > > > > Cheers
> > > > > Dave
> > > > >
> > > >
> > > > --
> > > >
> > > > Alan D. Mead, Ph.D.
> > > > President, Talent Algorithms Inc.
> > > >
> > > > science + technology = better workers
> > > >
> > > > http://www.alanmead.org
> > > >
> > > > The irony of this ... is that the Internet is
> > > > both almost-infinitely expandable, while at the
> > > > same time constrained within its own pre-defined
> > > > box. And if that makes no sense to you, just
> > > > reflect on the existence of Facebook. We have
> > > > the vastness of the internet and yet billions
> > > > of people decided to spend most of them time
> > > > within a horribly designed, fake-news emporium
> > > > of a website that sucks every possible piece of
> > > > personal information out of you so it can sell it
> > > > to others. And they see nothing wrong with that.
> > > >
> > > > -- Kieren McCarthy, commenting on why we are not
> > > >                    all using IPv6
> >
> > --
> >
> > Alan D. Mead, Ph.D.
> > President, Talent Algorithms Inc.
> >
> > science + technology = better workers
> >
> > http://www.alanmead.org
> >
> > The irony of this ... is that the Internet is
> > both almost-infinitely expandable, while at the
> > same time constrained within its own pre-defined
> > box. And if that makes no sense to you, just
> > reflect on the existence of Facebook. We have
> > the vastness of the internet and yet billions
> > of people decided to spend most of them time
> > within a horribly designed, fake-news emporium
> > of a website that sucks every possible piece of
> > personal information out of you so it can sell it
> > to others. And they see nothing wrong with that.
> >
> > -- Kieren McCarthy, commenting on why we are not
> >                    all using IPv6


reply via email to

[Prev in Thread] Current Thread [Next in Thread]