savannah-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers] Re: copyright/license utility


From: Aubrey Jaffer
Subject: [Savannah-hackers] Re: copyright/license utility
Date: Sun, 2 Feb 2003 22:56:58 -0500 (EST)

 | From: Mathieu Roy <address@hidden>
 | Date: 17 Nov 2002 01:50:01 +0100
 | 
 | Ok, it seems interesting and could help in project approval task.
 | 
 | Is it packaged or something? Is it perl, shell?

SCM.  I haven't had the time to do the rewrite, so I'm releasing the
prototype.  I expect there is a dynamic programming algorithm for
matching against multiple fixed strings.  Anyone interested in working
on this?

http://swissnet.ai.mit.edu/~jaffer/Docupage/copyrights

------------------------------------------------------------------------

The *copyrights* program searches through the given files looking for
"copyright", "Copr.", "&#169;", or "©" (followed by years and name of
holder in either order). It also recognizes the phrase "public domain".
For each file it reports each distinct copyright once.

------------------------------------------------------------------------

      Quick Start

    * Obtain SCM <../SCM> Scheme Implementation.
    * Obtain SLIB <../SLIB> Scheme Library.
    * Obtain copyrights.scm (6.kB) and install in PATH directory as
      "copyrights".

      Usage

Usage: copyrights [-m NUM] [-n CNT] FILE1 FILE2 ...

  Displays at most NUM (default 10) of each FILE's copyrights occuring
  within the first CNT (default 1000) lines or groups of 100 chars.
  Returns 0 if a copyright is found; otherwise returns 1.

Usage: copyrights [-m NUM] [-n CNT] -
  As above, but reads filenames from standard input.


      Examples

$ copyrights [oO]vertones.*
Overtones.html:59:Copyright 2003 Aubrey Jaffer</ADDRESS>
overtones.png:30:Copyright 2003 Aubrey Jaffer/z
overtones.ps:184:Copyright 2003 Aubrey Jaffer) show

$ copyrights AnaLugojana.*
AnaLugojana.abc:29:Copyright 1999 Voluntocracy.
AnaLugojana.mid:1:Copyright 1999 Voluntocracy.
AnaLugojana.pdf:

$ copyrights sharpbang.c
sharpbang.c:2:   This program is in the public domain.


      Further Development

This first release of *copyrights* is a simple proof of concept. The
list of desired improvements is so extensive that the next program
need not reuse anything in the current version.

    * Each copyright instance should be scored; and the user able to
      specify thresholds for detection.

    * Misspellings of key words should be recognized and ranked with a
      lower score.

    * Two-digit years (98) should be recognized and scored depending
      on the presence of 4-digit years in the same instance.

    * Copyright years are usually given in numerical order; if an
      instance is scrambled, its score should be lowered.

    * Copyright years are usually given either before the holder name
      or after the holder name, but not both places. Instances should
      be scored accordingly.

    * The user should be able to specify year ranges to be penalized
      in scoring.

    * The user should be able to specify that high quality copyright
      instances of particular holders not be reported.

    * For copyright instances with specified holders, and on the basis
      of dates in CVS/Entries, the program should be able to report
      which files lack a copyright year for the date they were last
      modified.

    * Loic Dachary suggests a grep-like -v option to report those
      files lacking a copyright. I suggest that the user also be able
      to specify a file size minimum for signaling this.

    * There is a large collection of software licenses in
      http://www.gnu.org/philosophy/license-list.html. The
      *copyrights* program should be able to scan for the presence of
      a license immediately following the copyright instance; and
      identify the close (and closest) matches among the GNU license
      collection.

Nearly every aspect of the future program incorporating these
improvements is approximate. The steady progress of genomics matching
algorithms lets us find close matches in linear time and space (some
titles tease about sub-linear times).

The SLIB function diff:edit-length returns the edit distance between
tokenized word sequences; and should be usable for license matching.
Dynamic-programming the copyright instance search can likely do the
rest (or see Curriculum Vitae: Gene Myers
<http://www.cs.arizona.edu/people/gene/vita.html> for approximate
pattern matching algorithms).

Histocomputability is computation modeled on biological immune
function. In this paradigm *copyright* is the Major Histocompatibility
Complex presented by nearly every cell (data file) type in an
individual organism (software package). The *copyrights* program
assumes the function of CD4 and CD8 T-cells in recognizing self and
non-self copyrights.

Copyright 2002, 2003 Aubrey Jaffer




reply via email to

[Prev in Thread] Current Thread [Next in Thread]