pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Excessive file system usage


From: Dave Trollope
Subject: Re: Excessive file system usage
Date: Wed, 4 Dec 2019 14:06:55 -0600

Once the conversion is complete the space is returned so its not a long term 
problem - only during the conversion. This became an issue because in 
kubernetes you control your resources much more tightly and that’s why this was 
highlighted.

I’m not sure there is anything special about the SAV files, so yes I would 
expect it to be easily reproducible - but at this point I don’t know what I 
don’t know that might be relevant ;-)

I will try running the same thing on a regular ec2 vs docker as mentioned in my 
earlier email and verify if this is truly unique to docker based environments - 
but my gut tells me it is not, we just didn’t notice before because we had lots 
of space on the machine.

Cheers
Dave
On Dec 4, 2019, 11:15 AM -0600, Alan Mead <address@hidden>, wrote:
> I'm curious to see what the devs say. I think they use Debian, but I don't 
> know about docker.
>
> So is the excessive disk space used and then returned and when pspp is done, 
> so only 150MB are consumed? Or is it that many GB of storage seem to 
> disappear (so maybe the file shows a CSV file size of 150MB but the docker 
> container 7gb bigger)?
>
> If I wanted to replicate the behavior, are there any special aspects to the 
> datafiles? I'd create a SAV file with a few columns and enough rows of random 
> data to make a 1GB SAV file. Right?
> Then I'd run your script to create the CSV. Right? And if I did this on a 
> stock Linux host without docker/ramfs/etc., I wouldn't see 7GB of space 
> consumed during the conversion, but if I then arranged to do the same test 
> using docker or ramfs, I would? Is that correct?
>
> If so, that seems to indicate something to do with docker/ramfs, right? Or, 
> you're saying this would affect a physical linux host equally?
>
> -Alan
>
>
> On 12/4/2019 9:24 AM, Dave Trollope wrote:
> > Hi Alan,
> >
> > Sorry, yes I forgot to mention this is linux, Debian GNU/Linux 9
> > Linux e1e6db1d8408 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 
> > x86_64 GNU/Linux
> >
> > I’ve reproduced this behavior in kubernetes and outside kubernetes in a raw 
> > docker container so its not kubernetes specific but may be related to the 
> > way the containerized image is built in docker.
> >
> > We haven’t observed this on our standard ec2, but to be honest we haven’t 
> > monitored in the same way - I can try that and see. We have enough space 
> > there that it could have gone unnoticed. I will try.
> >
> > What I'm doing is watching the filesystem as the SAVE TRANSLATE command is 
> > running, using watch -n 0.5 "df -H; ls -ltr /tmp"
> >
> > The only file being written is the csv but the filesystem used space is 
> > dropping at a much higher rate than data being written. No other temp files 
> > are being placed in /tmp
> >
> > I also reproduced this using a ram based fs - if you watch the usage it 
> > behaves the same so I don't think its specific to dockerized filesystems, 
> > but I might yet be wrong on that.
> >
> > The link you share is a common problem when starting out with containers 
> > where the build process creates lots of images. As you build lots of 
> > images, you have to cleanup. Its one of the first things you learn as you 
> > step in to the container world!
> >
> > Appreciate the quick reply. It certainly was a shocking observation when I 
> > found it :-)
> >
> > Cheers
> > Dave
> >
> >
> > On Dec 4, 2019, 8:29 AM -0600, Alan Mead <address@hidden>, wrote:
> > > Wow, that's a lot. Do you mean that 7GB of space are needed (for, I guess 
> > > temporary files)? And you did not observe that previously?
> > >
> > > Maybe the devs are familiar with kubernetes; I only know the name. Can 
> > > you describe the environment (e.g., OS)? And pspp version? How many 
> > > conversions have you observed this behavior?
> > >
> > > And you're sure this isn't a kubernetes problem (like it's making 
> > > snapshots as it writes the file or something)? I ask because when I 
> > > google about this, it looks like there are sharp edges; glancing through, 
> > > these don't seem to directly and specifically address the behavior you're 
> > > seeing, but it looks like there could be these kinds of issues with 
> > > kubernetes and the PSPP devs wouldn't be able to help unless they knew 
> > > kubernetes:
> > >
> > > https://cntnr.io/whats-eating-my-disk-docker-system-commands-explained-d778178f96f1
> > > https://softwareengineeringdaily.com/2019/01/11/why-is-storage-on-kubernetes-is-so-hard/
> > >
> > > -Alan
> > >
> > >
> > > On 12/4/2019 6:40 AM, Dave Trollope wrote:
> > > > We just moved Pspp to Kubernetes containers where we use it to extract 
> > > > csvs from sav files. The sav files are about 1gb and each csv is about 
> > > > 150mb.
> > > >
> > > > We’ve watched the file system as it does it and over 7gb of the file 
> > > > system is used while writing 150mb. I assume the SAVE command is doing 
> > > > lots of seeks and insertions in the file magnifying the file system 
> > > > usage. Any options to limit this behavior?
> > > >
> > > > Here is the script we are using
> > > > GET FILE = "{}"
> > > >
> > > > SAVE TRANSLATE
> > > >  /OUTFILE="{}"
> > > >  /TYPE=CSV
> > > >  /FIELDNAMES
> > > >  /REPLACE
> > > >  /KEEP={}
> > > >  /MISSING=RECODE
> > > >  /CELLS=LABELS.
> > > > Cheers
> > > > Dave
> > > >
> > >
> > > --
> > >
> > > Alan D. Mead, Ph.D.
> > > President, Talent Algorithms Inc.
> > >
> > > science + technology = better workers
> > >
> > > http://www.alanmead.org
> > >
> > > The irony of this ... is that the Internet is
> > > both almost-infinitely expandable, while at the
> > > same time constrained within its own pre-defined
> > > box. And if that makes no sense to you, just
> > > reflect on the existence of Facebook. We have
> > > the vastness of the internet and yet billions
> > > of people decided to spend most of them time
> > > within a horribly designed, fake-news emporium
> > > of a website that sucks every possible piece of
> > > personal information out of you so it can sell it
> > > to others. And they see nothing wrong with that.
> > >
> > > -- Kieren McCarthy, commenting on why we are not
> > >                    all using IPv6
>
> --
>
> Alan D. Mead, Ph.D.
> President, Talent Algorithms Inc.
>
> science + technology = better workers
>
> http://www.alanmead.org
>
> The irony of this ... is that the Internet is
> both almost-infinitely expandable, while at the
> same time constrained within its own pre-defined
> box. And if that makes no sense to you, just
> reflect on the existence of Facebook. We have
> the vastness of the internet and yet billions
> of people decided to spend most of them time
> within a horribly designed, fake-news emporium
> of a website that sucks every possible piece of
> personal information out of you so it can sell it
> to others. And they see nothing wrong with that.
>
> -- Kieren McCarthy, commenting on why we are not
>                    all using IPv6


reply via email to

[Prev in Thread] Current Thread [Next in Thread]