Instiki
Pie in the sky Refbase intégration

No decent academically-oriented wiki software should be without a facility for handling citations. For a long time, I’d toyed with, and rejected the idea of building a bibliographic software subsystem into Instiki. Such a subsystem would

  • have to interface with the APIs of a number source: the arXiv, SPIRES, MathSciNet, …
  • be able to manipulate Bibtex, a gawdawful format, for which no decent Ruby libaries exist
  • amount to reinventing the wheel, since there are already quite a number of bibliographic software systems out there, with existing userbases and developer communities.

One such bibliographic software system is RefBase. Andrew Stacey has recently begun hacking on RefBase, so it occurs to me that one might use RefBase as a bibliographic back-end, and have Instiki grab citations from it, via its XML API.

The purpose of this page is to try to map out what would be required to implement such a scheme.

First of all, we’ll need a new model

 class Citation < ActiveRecord::Base
   has_and_belongs_to_many :revisions
   belongs to :web

Citations are associated to Revisions, rather than to Pages, as the list of citations, on a page, can and will change between revisions of that page.

What should the Citation table look like?

 def self.up
   create_table :citations do |t|
     t.string :bibtex_key
     t.text :xhtml
     t.datetime :created_at
     t.datetime :updated_at
     t.string :etag
     t.integer :web_id
     t.integer :refbase_record
   end

   create_table :citations_revisions :id => false do |t|
     t.integer :revision_id
     t.integer :citation_id
   end

   add_index :citations_revisions, [:revision_id, :citation_id], :unique => true
   add_index :citations_revisions, :revision_id,  :unique => false
   add_index :citations, :bibtex_key
   add_index :citations, :web_id

   add_column :webs, refbase_url, :string
   add_column :webs, refbase_username, :string
   add_column :webs, refbase_password, :string
 end

Since we’re using RefBase as a backend, we don’t need to be able to edit, or otherwise manipulate, the bibliographic information. All we need is to be able to do is output it in two different formats

  1. as a \cite{bibtex_key}, in the LaTeX output
  2. as a formatted citation (a blob of XHTML) at the bottom of the page, in the show output.

Question: might we also want to be able to spit out bibtex format, retrieved from RefBase? If so, add another column for that.

We also need to be able to update citations by re-fetching them from RefBase. Hence the etag column. I assume RefBase supports the If-None-Match header. If not, who do I have to beat around the temples?

Of course, we’ll also modify the Revision, Page and Web classes:

 class Revision < ActiveRecord::Base
   belongs_to :page
   has_and_belongs_to_many :citations
   composed_of :author, :mapping => [ %w(author name), %w(ip ip) ]
 end

and

 class Page < ActiveRecord::Base
   belongs_to :web
   has_many :revisions, :order => 'id', :dependent => :destroy
   has_many :citations, :through => :revisions
   has_many :wiki_references, :order => 'referenced_name'
   has_one :current_revision, :class_name => 'Revision', :order => 'id DESC'

and

 class Web < ActiveRecord::Base
   ## Associations

   has_many :pages,      :dependent => :destroy
   has_many :wiki_files, :dependent => :destroy
   has_many :citations,   :dependent => :destroy

   has_many :revisions,  :through => :pages

We’ll need a new Controller

 class CitationController < ApplicationController

with, at least, the usual CRUD actions. The create and update actions will involve querying a RefBase server.

We’ll also need to be able to configure the Citation Controller (supplying a URL for the RefBase server, and perhaps authentication information).

Question: Should that be done in the AdminController or in the `CitationController?

Question: What other actions does the CitationController need? Should we, for instance, be able to obtain a bibtex file of the citations associated to a given page?

Modifications to the Chunk Handler

We need to support citations in the Chunk Handler.

Questions:

  1. When do these Refbase lookups happen? Presumably, when the page is rendered. That seems like a real performance hit, but is there any alternative?

    This seems interesting, in that regard. But there are probably other, simpler, asynchronous solutions.

  2. How, exactly do we do the lookup? Collect all the citations on the page and query the server in one batch?

  3. The RefBase API seems to require our knowing the RefBase record number. That’s brain-dead. Can we retrieve the data we want, using the bibtex_key instead?

    In the above database schema, I assumed we need to keep track of the refbase_record. It would be nice if we could dispense with that.

  4. What do we do when the RefBase server is unavailable, when the given bibtex_key is not found on the server, etc?

So many questions …

To Heck With RefBase?

Below, Andrew points out that, perhaps, RefBase is ill-suited to the task at hand. He also points out that RefBase may contain stale data.

Perhaps we should simply query the public information servers directly. For instance, consider the example I gave below. In my text, I have a \cite{Weinberg:2008si} (based on the SPIRES Bibtex key).

We could send a query to SPIRES

http://www.slac.stanford.edu/spires/find/hep/xmlpublic?texkey=Weinberg:2008si

to obtain an XML representation of the desired citation. Or we could GET

http://www.slac.stanford.edu/spires/find/hep/wwwbriefbibtex?texkey=Weinberg:2008si

to obtain the Bibtex version (embedded in a <pre> element) of the same information.

Of course, if we don’t know the SPIRES Bibtex key, we could query on the eprint number, instead

http://www.slac.stanford.edu/spires/find/hep/xmlpublic?eprint=arXiv:0810.2831
http://www.slac.stanford.edu/spires/find/hep/wwwbriefbibtex?eprint=arXiv:0810.2831

Unfortunately,

  1. not all references are on SPIRES (or MathSciNet or …)
  2. Even for those which are, we still might want to edit the information by hand. (SPIRE’s Bibtex output is somewhat cruddy, and invariably needs some hand-tweaking.)

So, if we went that route, we’d need to build an interface for hand-entering, and editing the citations. In other words, we’d need to re-create another bibliographic software package, a road we don’t want to go down.

Comments

Andrew Stacey: Firstly, let me say (as I know the RefBase team are aware of this!) that I’m finding RefBase (modulo a few hacks to make it more mathematically friendly) an extremely useful tool, and I’ve not yet explored all its possibilities. However, I’m no longer so sure that it’s quite what is needed here. I originally thought that the idea was simply a link-up system between a reference database and an instiki wiki. Now it seems that a much tighter integration is possible (via Active Resource).

One of my reasons for pulling back is the user authentication for RefBase. It seems that RefBase is designed to be a central server that users log on to and can store personalised information (mainly tags and citation keys, but there are plans afoot to have commentaries as well). This is common to the other database systems that I looked at before settling on RefBase. This seems opposite to the Instiki-style of complete openness (at least on the public webs).

Also, RefBase and the like copy information across from the various sources. One of my hacks was to add a MathSciNet import tool (arXiv was already there). The problem with this is what to do when the source record is updated - which becomes more likely as more preprint servers spring into existence. It’s entirely possible for the RefBase information to be out of sync with the arXiv (say) information. Automatically updating is problematic since other information may have been added by the user.

On the other hand, one does want to allow for some local storage since it may be useful to point to people’s homepages where it is less sure that information will stay.

I’m trying to imagine what would be a useful system for the user. There are two places that I can see a bibliographic system coming into use: as a source of citations, and as a “thing to be discussed”.

As a source of citations, I guess we want the user to put some code like \cite{key} and have the Instiki process ask the database process for the record associated with key. The problem with that is that the user has to go across to the database to get the correct value for the key anyway (can anyone remember all their bibtex citation keys?), so why not just have the reference program spit out the right markdown syntax to copy-and-paste in? One could easily just use footnotes for this. Much better would be if the user could put in a search term instead of the key and have the reference system offer a list of possible matches from which the user could select the right one. That needs a couple more passes of the data, I suppose, but would be way more useful.

As a reader of the page, I want to be able to click on the citation and be taken to the page for that citation, from which I can, in a click or two, go to a variety of bits of information for that reference. Here’s where it’s most like the current systems and doesn’t need too much integration.

These are just my initial thoughts. No doubt they’ll change as I think a bit more about it.

Jacques responds: One of the original motivations for my work on Instiki was to produce an environment where people could bat around ideas which might eventually work their way into a paper. So the LaTeX export function is important. I’d like to be able to spit out a LaTeX version of a page, paste that into a paper I am writing, and have it “just work”. In that version, citations look like \cite{Weinberg:2008si}. But, on the wiki, I want a hyperlink to a formatted reference at the bottom of the page, which looks like

Steven Weinberg, “Non-Gaussian Correlations Outside the Horizon II: The General Case,” Phys. Rev. D79, 043504, arXiv:0810.2831 [hep-ph].

I realize that this use-case may not accord very well with the use-cases that RefBase was designed for. But I don’t think it’s too far off the mark.

This seems opposite to the Instiki-style of complete openness (at least on the public webs).

Not irremediably. We’d just need to create an “Instiki” user on the RefBase server (or, for finer control, one such user for each web on our Instiki installation).

I also agree that building an interface to RefBase’s search facility would be a nice 2nd step. But first things first: being able to extract data from RefBase, store it in Instiki, and spit out a formatted version on-demand, would be quite a nice first step.

Andrew Stacey It’s tempting to use RefBase. It’s there, I’ve already a little experience in hacking it (though I should make clear that most of my hacks are “surface level”, I’ve not had to look too deeply into the code), and it almost does what is wanted. I picked it because it was almost what I wanted, but I then found that it wasn’t quite what I wanted and so started hacking it (if the RefBase developers are keeping an eye on this, I have a half-written email to you guys describing my hacks and asking if you want me to send any to you. One day I’ll finish the email …). Of the various programs I tried, it seemed the easiest to hack, however that wasn’t because it had lots of plugins and modules, but because the code was well commented.

But my point is that it probably isn’t quite right, so to get something right for Instiki then it’s going to need a little tweaking. So I’m wondering whether or not it’s best to work out exactly how this thing would be used and then see if RefBase (or other) is close to what is wanted.

So here’s a workflow from the writer’s point of view.

  1. Start writing an article. Want to cite something external to the wiki.
  2. Find the relevant cite key and write \cite{key} at the appropriate juncture.
  3. There is no step three (TM).

Here’s a workflow from the reader’s point of view.

  1. Start reading an article. Notice a citation and decide to follow it.
  2. Click on the citation, it takes me somewhere useful, preferably with an easy way back again (I know that the browser’s back button ought to do this, but …).
  3. Click on the ‘TeX Export’, get a single document that contains everything that is needed.

Okay, so the Instiki+RefBase/Whatever process needs to do the following:

  1. Make it easy to find the relevant cite keys.
  2. Make it easy to create a useful relevant page.
  3. Produce citations in the relevant format for inclusion in the corresponding output.

It feels as though you’re concentrating on the 3rd step here.

But let’s take that and run with it - I’m happy to work on several levels here. And let’s stick with RefBase since I have a working installation that I’m happy to play around with and I already have a little familiarity with the code.

One could easily subvert the authentication. If the installation was dedicated to the relevant Instiki process then you’d want it open, just with good logging. So then you want to send RefBase a list of references and get a nicely formed XML back. RefBase certainly can export XML so this should be no problem. Then Instiki can cache the XML, updating it if the citations on the page change (though it may be good to have a way of manually forcing a reload).

Either RefBase can itself export the bibtex, or Instiki can internalise the conversion. Doing it internally would make it easier to change the RefBase component for something else, but as you say right at the top that would involve writing a Ruby BibTeX library. On the other hand, one could simply use the bibtool program (which is what the main branch of RefBase does anyway).

Ah, I’ve just noticed something in what you said above. You want to be able to ‘cut and paste’ from an Instiki TeX export into another paper. That means that you’ll need your citation keys to match, and what you choose and what I choose won’t necessarily be the same unless we all use the same RefBase installation as our source for references. That is, the paper that you’re writing must also get its reference data from that RefBase installation. That complicates matters a little, either way.

Suppose we’re writing a joint paper. We’re not doing the whole thing on Instiki, but parts get developed there (maybe I’m being a bit pigheaded about using \( instead of dollars) and others bits are developed locally. I figure I’ll get in a bit of self-promotion and flagrantly cite ‘Comparative Smootheology’. As it’s a paper I wrote, it’s been in my reference database since before anything appeared in public and I have a citation key from that time. So I cite it \cite{as8}. You, in a gesture of generosity, meanwhile decide that you’ll help me with my citation index and cite it in the part you’re developing offline. So you cite it as \cite{math/0802.2225}. Meanwhile, over in the public area we also include it and cite it as \cite{782} (since that’s it’s key in the database). At some point, this lot needs sorting out.

The easiest way is if we’ve agreed to use the RefBase installation from the start. But that means that I need to be able to put up all sorts of crud that no-one else is particularly interested in because if I’m using it for one of my papers then I’m going to use it for all of them otherwise it’s yet another system to learn.

The more complicated way is for someone to manually sort out the various references. I guess the easiest way is for each program to export its references in BibTeX format, compile the document, and then look for duplicates.

The most elegant way would be if the reference program was really just a portal and that when I type \cite{as8}, you type \cite{math/0802.2225}, and we both type \cite{782} then each reference program simply converts that to a citation to the arXiv reference and the BibTeX version of the arXiv reference is extracted. This would involve a layer on top of the usual \cite command: there would need to be a “look-up list” so that several keys could point to the same citation. It’d be a pain to do in TeX, but not difficult (I’ve a little experience of hacking TeX to do associative arrays and object-oriented structures; I probably didn’t find the best way of doing it, but I found ways that worked).

Again, I’m just trying to think of the right design before integrating the system too closely with RefBase.

My other caution on RefBase - which is more about it being PHP than anything else - is the difficulty of having two obviously separate systems. It appears that some still have issues with RSS and won’t notice an additional system like this unless it is grafted into Instiki. That would be easier if it were a Rails app.

Andrew Stacey (5 minutes later). I hadn’t spotted your added section above. Okay, so you don’t want to write your own reference software. Fair enough, I’m not too keen on it either! RefBase doesn’t currently do SPIRES, but it shouldn’t be hard to add (I added MathSciNet which just goes to show how easy it is to do). The stale data is a problem, but then it’s a problem that RefBase should think about anyway. That makes me think that there are two classes of hacks that we’d need to make to RefBase: general improvements, and specific integration issues. I’ve no hesitation about doing the first. What I’m a little hesitant at is doing the second. So what I’m trying to figure out is how much of the second type will there be? Will we look back in 6 month’s time and say “It would have been easier just to write our own app.”? I hope not, but to get a fair idea then I think it’s good to play through the possible usage scenarios.

Andrew wrote:

  1. Make it easy to find the relevant cite keys.
  2. Make it easy to create a useful relevant page.
  3. Produce citations in the relevant format for inclusion in the corresponding output.

It feels as though you’re concentrating on the 3rd step here.

Step 2. is the raison-d’etre of Instiki. Step 1. is (or should be) the raison-d’etre of any bibliographic software system (of which RefBase is the proximate example). That leaves Step 3…

I don’t really know whether RefBase is the “right” bibliographic software system to be working with. Indeed, if you look at most of what I’ve written above, it’s essentially independent of the bibliographic software system being used (largely, but not entirely, due to the vagueness of what I’ve written).

If there’s a better choice than RefBase, let’s use that. I agree that a Rails App would be easier:

  1. It could be integrated more tightly with Instiki.
  2. Rails is much easier to work with than any PHP program I have ever looked at.

Unfortunately, there isn’t — to my knowledge — an existing Rails bibliographic software package to work with. We could write one from scratch. It might even be fun. But would it be the best use of our time?

The rest of the problems you cite (our using different bibliography managers, different citation keys, etc …) are very real. But I don’t expect to solve them. Let’s even forget (for the moment) about the whole paper-writing enterprise. Let’s just ask if we can build something that would streamline adding citations to an Instiki wiki (like the nlab).

November 2016 comments from a refbase developer:

Not sure of refbase is the right choice for you, but there was some somewhat poor information above, so I thought I’d clarify some things:

  • refbase has integration with MediaWiki. In this case, the answer to “When do these Refbase lookups happen?” is “ whenever an uncached version of the page is loaded”. Caching within the wiki solves the performance issue for any data that is loaded from outside the wiki.
  • refbase citekeys are user-specific and not guaranteed to be unique; the latter made them unsuitable for UnAPI. But they can be queried in the MediaWiki extension and in the SRU API.
  • Cite-keys of other users can be searched in this way. If users are drawing from the same refbase instance, presumably this could allow you to programmatically convert between different people’s preferred keys (or allow you to tie the keys your using ot a single refbase user account so the proper record is retrieved).
  • Most of refbase is completely open and, if anything, we’ve tried to make user-data more shareable between users.
  • refbase records are dated; it is conceivable that you could check into arXiv or SPIRES or similar for a newer version of the record.
  • In addition to returning XML, BibTeX, or other stuctured data, refbase can just return formatted citations.
  • Andrew should feel free to send any of his contributions our way.
  • Please ping our mailing list or forums if you have questions!

Browser Cruft, Installing on Shared Hosts, Installing under MacOSX Tiger