[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Savannah-hackers] Re: copyright/license utility
From: |
Aubrey Jaffer |
Subject: |
[Savannah-hackers] Re: copyright/license utility |
Date: |
Sun, 2 Feb 2003 22:56:58 -0500 (EST) |
| From: Mathieu Roy <address@hidden>
| Date: 17 Nov 2002 01:50:01 +0100
|
| Ok, it seems interesting and could help in project approval task.
|
| Is it packaged or something? Is it perl, shell?
SCM. I haven't had the time to do the rewrite, so I'm releasing the
prototype. I expect there is a dynamic programming algorithm for
matching against multiple fixed strings. Anyone interested in working
on this?
http://swissnet.ai.mit.edu/~jaffer/Docupage/copyrights
------------------------------------------------------------------------
The *copyrights* program searches through the given files looking for
"copyright", "Copr.", "©", or "©" (followed by years and name of
holder in either order). It also recognizes the phrase "public domain".
For each file it reports each distinct copyright once.
------------------------------------------------------------------------
Quick Start
* Obtain SCM <../SCM> Scheme Implementation.
* Obtain SLIB <../SLIB> Scheme Library.
* Obtain copyrights.scm (6.kB) and install in PATH directory as
"copyrights".
Usage
Usage: copyrights [-m NUM] [-n CNT] FILE1 FILE2 ...
Displays at most NUM (default 10) of each FILE's copyrights occuring
within the first CNT (default 1000) lines or groups of 100 chars.
Returns 0 if a copyright is found; otherwise returns 1.
Usage: copyrights [-m NUM] [-n CNT] -
As above, but reads filenames from standard input.
Examples
$ copyrights [oO]vertones.*
Overtones.html:59:Copyright 2003 Aubrey Jaffer</ADDRESS>
overtones.png:30:Copyright 2003 Aubrey Jaffer/z
overtones.ps:184:Copyright 2003 Aubrey Jaffer) show
$ copyrights AnaLugojana.*
AnaLugojana.abc:29:Copyright 1999 Voluntocracy.
AnaLugojana.mid:1:Copyright 1999 Voluntocracy.
AnaLugojana.pdf:
$ copyrights sharpbang.c
sharpbang.c:2: This program is in the public domain.
Further Development
This first release of *copyrights* is a simple proof of concept. The
list of desired improvements is so extensive that the next program
need not reuse anything in the current version.
* Each copyright instance should be scored; and the user able to
specify thresholds for detection.
* Misspellings of key words should be recognized and ranked with a
lower score.
* Two-digit years (98) should be recognized and scored depending
on the presence of 4-digit years in the same instance.
* Copyright years are usually given in numerical order; if an
instance is scrambled, its score should be lowered.
* Copyright years are usually given either before the holder name
or after the holder name, but not both places. Instances should
be scored accordingly.
* The user should be able to specify year ranges to be penalized
in scoring.
* The user should be able to specify that high quality copyright
instances of particular holders not be reported.
* For copyright instances with specified holders, and on the basis
of dates in CVS/Entries, the program should be able to report
which files lack a copyright year for the date they were last
modified.
* Loic Dachary suggests a grep-like -v option to report those
files lacking a copyright. I suggest that the user also be able
to specify a file size minimum for signaling this.
* There is a large collection of software licenses in
http://www.gnu.org/philosophy/license-list.html. The
*copyrights* program should be able to scan for the presence of
a license immediately following the copyright instance; and
identify the close (and closest) matches among the GNU license
collection.
Nearly every aspect of the future program incorporating these
improvements is approximate. The steady progress of genomics matching
algorithms lets us find close matches in linear time and space (some
titles tease about sub-linear times).
The SLIB function diff:edit-length returns the edit distance between
tokenized word sequences; and should be usable for license matching.
Dynamic-programming the copyright instance search can likely do the
rest (or see Curriculum Vitae: Gene Myers
<http://www.cs.arizona.edu/people/gene/vita.html> for approximate
pattern matching algorithms).
Histocomputability is computation modeled on biological immune
function. In this paradigm *copyright* is the Major Histocompatibility
Complex presented by nearly every cell (data file) type in an
individual organism (software package). The *copyrights* program
assumes the function of CD4 and CD8 T-cells in recognizing self and
non-self copyrights.
Copyright 2002, 2003 Aubrey Jaffer
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Savannah-hackers] Re: copyright/license utility,
Aubrey Jaffer <=