[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: good sample data sets for use in documentation

From: Jason Stover
Subject: Re: good sample data sets for use in documentation
Date: Tue, 28 Oct 2008 09:59:03 -0400
User-agent: Mutt/1.5.18 (2008-05-17)

On Mon, Oct 27, 2008 at 09:55:26AM -0700, Ben Pfaff wrote:
> Jason Stover <address@hidden> writes:
> > On Sun, Oct 26, 2008 at 10:49:52AM -0700, Ben Pfaff wrote:
> >> I would like to start including examples in the PSPP
> >> documentation that work with realistic, interesting data sets
> >> that we also include with PSPP.  To do this, I need some freely
> >> distributable (ideally, public domain) data sets.  I have found
> >> some of these on the web, but none seems really perfect, and I
> >> wonder whether any of you have data sets to suggest?
> >
> > Do you mean data sets posted by organizations that collected data as
> > part of a designed experiment or observational study, or just anything
> > we cobbled together?
> >
> > I have some of the latter.
> It's probably good to have a mix of both.  Yesterday, I was
> looking around for the former.  Based on my web searches, other
> things that are nice, but not entirely necessary, are:
>         - Not too specific to any particular country or region,
>           so that they will be more likely to be interesting to
>           users throughout the world.
>         - Formatted to be easily imported.  Notably, Excel
>           spreadsheets are not particularly easy at the moment,
>           and there are lots of websites with HTML tables that
>           don't provide any other format.
>         - I find it at least mildly interesting, and I understand
>           what it's about.  (Obviously this is highly
>           subjective.)

I have several different text files with data sets of different
types. I gathered them from electronic sites, and did some reshuffling
to make them presentable. Here is a list of the data sets I know
I have:

- Text data scraped from 158 novels I downloaded from
  Each row represents 1 sentence. Most columnn represents the
  frequency of a word used in that sentence. One column holds the
  author's name. Another holds the title. This is a large data file,
  with about 1.3 million cases and around 10 variables.

- Data on crashes that occurred on US Highways in 2004, taken from the
  National Highway Traffic Safety Administration. Each row represents
  one vehicle that collided with something. Variables include the
  estimated speed at which the vehicle was traveling when it collided,
  severity of injury of the occupants, and the cause of the
  collision. There are around 25000 cases (I think).

- Climate data. These data I took from an online database at some
  university (I can find the source if you want to include it). It
  includes about 600000 cases, with the following variables: Country,
  weather station ID, year, month, and (I think) day, and average
  temperature. The data for some of the older stations go as far back
  as about 1800. Most of the others have records as far back as the

- Stock market data. I have some data involving price changes in the
  New York Stock Exchange. I have a lot of this, for different

All files are tab or comma-delimited text.

I'll rummage around my data directories and look for some other files.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]