# Created 2015-02-02 Mon 09:49 #+TITLE: A Proposal for Org citation syntax #+AUTHOR: Richard Lawrence * Introduction In brief, the proposal is: 1. Use the Pandoc syntax for basic, inline citations. 2. Extend the Pandoc syntax modestly to accommodate backend-agnostic formatting of inline citations. 3. Also allow non-inline citation definitions, with a syntax comparable to non-inline footnotes, to accommodate backend-specific formatting. Basing this proposal on the Pandoc syntax is a `merely practical' choice. It might not be the most Org-like, and it might produce too much conceptual divergence between citations and links. But it is a syntax that is already well-tested and known to work elsewhere, and which has easily-scriptable tools for processing it (namely, Pandoc's own), which Org users could rely on in the meantime, while Org's own implementation of this syntax catches up. Beyond the features provided by the basic Pandoc syntax, I have tried to ensure that the other features are as Org-like as possible, are already in use in Org documents, and (I hope) could be implemented with minimal work. * Citation syntax (I repeat the list of requirements I posted earlier here, for easy reference; so far, I don't think anyone has suggested we need any others.) A citation is a textual reference to one or more individual works, together with other information about those works, grouped together in a single place. Within a citation, each reference to an individual work needs to be capable of containing: 1. a database key that references the cited work 2. prefix / pre-text 3. suffix / post-text 4. references to page/chapter/section/whatever numbers and ranges. This is likely part of the prefix or suffix, but might be worth parsing separately for localization or link-following behavior. 5. a way of indicating backend-agnostic formatting properties. Examples of some properties users might want to specify are: - displaying only some fields (or suppressing some fields) from a reference record (e.g., journal, date, author) - indicating that the referenced works should *only* appear in the bibliography of the exported document (equivalent of LaTeX \nocite) Citations as a whole also need: 6. address@hidden a way of indicating formatting properties for specific export backends. Examples of some properties that users might want to specify are: - a citation command to use for each individual reference (LaTeX; others?) - a multi-cite command to apply to all references together (LaTeX) - CSS or other styling class (HTML and derived backends; also ODT?) - properties describing how to treat emphasis and other formatting that cannot appear in plain text (ASCII and other plain text backends) ** Starting point I assume, to start, the basic Pandoc [ ... @key1 ...; ... @key2 ...] syntax for inline citations, documented [[http://pandoc.org/README.html#citations][here]]. This defines a syntax for inline citations that allows grouping multiple individual references together between brackets, with semicolons as separators. Previous discussions have suggested beginning citation definitions with a tag, like [cite: ...] or [citation: ...], by analogy with footnotes and links. As far as I can see, the tag doesn't really provide any advantages for inline citations, and is just unnecessary markup. This is because the syntax of citations is (or should be) more constrained than footnotes or links; a citation is already recognizable, and parseable as such, by the required presence of a reference key. The tag would also immediately break compatibility with the basic Pandoc syntax if it were required for inline citation definitions, a result which I am trying to avoid in this proposal. A syntax for /non-inline/ citation definitions, however, comparable to the syntax for footnotes, would make good use of such a tag. This is what I propose below. ** Backend-agnostic formatting properties *** Selecting specific fields Selecting specific fields to display could be done by appending field names to cite keys after colons, much like Org tags: #+BEGIN_QUOTE [See @Doe99, pp. 34--45; also @Doe00:year, section 6] [See their article in @Doe99:journal:year.] #+END_QUOTE Note that this would make for an extension of Pandoc syntax. This extension is not a strict superset, since Pandoc allows internal `:' characters in cite keys, and thus would treat address@hidden:journal:year' as a single key, rather than treating the key as ending at the first colon, with other data afterward. (More compatible but uglier alternatives for the field selector include `!', `{', `}', and `^'. If an alternative is desired, I suggest address@hidden,year}'.) When specific fields are requested, ONLY data from those fields should appear in the exported document. Backends would choose how to export these citations based on the selected fields. I would think the default behavior during export should be something like: get the reference record from the database, then pass it and a list of the requested fields to a user-customizable function which is expected to return a string to insert in the output. (The default function could, say, intersperse the requested field data with whitespace and add parentheses. More sophisticated functions could rely on external tools to format the citation using the Citation Style Language.) Of course, this assumes that the exporter has a way of querying the reference database, which would be fine for bibtex and org-bibtex databases, but may not be a good assumption in general. Specific backends could also do something different with field selectors when it makes sense to do so. For example, the LaTeX backend could choose \citeyear as the command to place in the exported document when just `:year' is requested in the citation. *** Non-cited works that should appear in the bibliography A special field selector `:nocite' would be one way to achieve citations that, for whatever reason, should appear in the Org source and in the exported bibliography, but should not appear in the exported text where they are placed. This would allow referencing them at relevant places in the document, like: #+BEGIN_QUOTE Smith said a lot of things, but no one can remember what they were. address@hidden:nocite] #+END_QUOTE One drawback of this syntax is that it does not provide an easy way to list all the nocite references, since the user would have to add `:nocite' to each one individually. This is not a huge problem for small numbers of refernces, but it would also be nice to have some equivalent of LaTeX's \nocite{*}. On this point, see the proposal for non-inline citation definitions below. ** Non-inline citation definitions and backend-specific formatting The syntax proposed above assumes citations are defined inline. A complementary alternative would be to treat citations like (non-inline) footnotes, with an inline reference and a definition elsewhere in the document. This could be convenient for citations that have lots of pre- or post-text. In that case, a citation could look like: #+BEGIN_QUOTE Doe provides an interesting analysis. [cite:1] ... * Citations [cite:1] See @Doe99, pp. 34--45; also @Doe2000:year, ch. 1. #+END_QUOTE That is, a citation /pointer/ would occur inline in the document text, which refers (via a number or a label) to a citation /definition/ in a specially-named subtree. The definition begins by repeating the pointer, and has the same syntax as proposed above, minus the enclosing square brackets. This approach could peacefully coexist with the above proposal for inline citations, in the same way that inline and non-inline footnote definitions now peacefully coexist. *** Backend-specific formatting In general, it would be nice to avoid formatting properties which are specific to a particular export backend when a backend-agnostic solution is available, but some backend-specific formatting needs are probably inevitable, so we need a syntax for specifying them. Another advantage of the non-inline citation syntax is that it would allow using the existing #+ATTR_BACKEND syntax to specify backend-specific formatting properties, since the citation definitions would be block-level elements: #+BEGIN_QUOTE * Citations #+ATTR_LATEX: :command citet #+ATTR_HTML: :class my-citation [cite:1] See @Doe99, pp. 34--45; @Foobar2000, ch.1. #+END_QUOTE This automatically makes the syntax readily extensible as new needs come up and target formats evolve. (Originally, I had thought about how to extend the inline citation definition syntax above to include backend-specific formatting information. But everything I came up with seemed pretty ugly, and not worth the extra syntax it would require. When I realized that non-inline definitions could leverage the existing syntax for backend-specific properties, I tossed that part of the proposal, though I'm happy to share it if anyone wants to see it.) Thus, I propose that, for authors who /need/ backend-specific formatting, this should be the way to do it. The above inline citation syntax should remain limited to uses where no backend-specific behavior is required. Note however that there is a tension here with the proposal above for backend-agnostic field selectors. I am not sure what should happen if, say, the user selects individual fields in the citation but also requests an incompatible citation command for a particular backend. *** Bibliography-only entries Non-inline definitions would also provide a convenient place to list non-cited references that should appear in the bibliography. For example: #+BEGIN_QUOTE * Citations ... [nocite:] @Doe99; @Foobar2000; @Baz98. #+END_QUOTE As a special case, #+BEGIN_QUOTE * Citations [nocite:*] #+END_QUOTE could introduce bibliography entries for everything in the reference database. * Document metadata In addition to the syntax of citations themselves, the Org document would also need to represent the following metadata to support citations: 7. address@hidden a pointer to one or more backend reference databases, including in-document databases in org-bibtex format 8. a reference to a citation style or style file 9. a reference to a locale file 10. an indication of where the bibliography should be found in the exported document (equivalent to \printbibliography, etc. in LaTeX) ** #+BIBLIOGRAPHY: reference database, style, locale The #+BIBLIOGRAPHY keyword already exists, in ox-bibtex.el (in contrib), though its current syntax does not quite meet all the above needs. I suggest changing the syntax to support in-file databases and a locale file. The point of specifying the style and locale as part of the #+BIBLIOGRAPHY definition is for compatibility with both LaTeX and Citation Style Language bibliography and citation formatting. In keeping with other metadata keyword lines (like #+OPTIONS), I suggest a key:value syntax for the arguments to #+BIBLIOGRAPHY, like so: #+BEGIN_QUOTE #+BIBLIOGRAPHY: db:/path/to/some/file.bib style:chicago #+BIBLIOGRAPHY: db:/path/to/some/file.bib style:plain locale:en_GB #+BIBLIOGRAPHY: db:"*Reference DB" #+END_QUOTE In the last example, the leading "*" is meant to indicate that the reference database is a subtree with headline "Reference DB", whose branches are in org-bibtex format. By specifying where the reference data is (and implicitly what format it is in, e.g., via the file extension), link-following and export behavior for citations can differ depending on the format of this database. For example, if the database is a .bib file, `following' a citation key could mean finding the corresponding entry in this file. If the database is an in-document tree in org-bibtex format, following a key could mean jumping to the headline whose :CUSTOM_ID: property agrees with that key. Likewise, if the database is in a format that the exporter knows how to read, then export backends could potentially look up information from it to create bibliography entries and citations in the exported document, possibly relying on an external tool (like citeproc-*) to transform them into the requested style. This would be particularly useful for non-LaTeX backends (which is what ox-bibtex.el focuses on at the moment). ** Bibliography placement The other issue is that Org documents must say where the bibliography should appear in exported documents. A reasonable default would be placing the bibliography at the end of the document. But some documents, in particular long ones, may need more flexibility in specifying where to place the bibliography. The simplest solution seems to be just allowing the #+BIBLIOGRAPHY keyword to appear anywhere in the document, to be replaced on export with the formatted bibliography. I think this is what ox-bibtex now does.