eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

A small FS in DATA and a pure Ruby compiler (in the classical sense)

DATA is one of those features one rarely sees in use, but it can be quite handy at times. I used it in rcov to include the xx markup generation library while ensuring the rcov executable remained self-contained (the extension is optional).

I've written a simple FS meant to be used with DATA, in order to structure it into individually accessible files. I then used it to implement a very simplistic pure Ruby compiler (in the sense of composing a .rb file out of many, i.e. a compiler as the very first ones, before the term started to be misused).

A small FS for the DATA section

I could have used minitar to create POSIX tar files in DATA, but I'd have had to implement random access on top of it, so I just defined a feeble YAML-based format:

 #!/usr/bin/env ruby
 # ...
 # this is the .rb file
 1 + 1
 <length of the toc>
 <YAML-serialized toc (obvious from the code)>
 data for all the files in the DataFS
 just one after the other

Creating the FS

The utterly simplistic API for the Writer class is

datafs = DataFS::Writer.new
datafs.add("filename", "file contents")
datafs.add("whatever.rb", "puts 1")
datafs.dump(someIO) # dump to someIO
puts datafs.dump    # just return the serialized representation

The implementation is trivial; an array of FStat objects (holding name, content length and position in the DATA stream) for the embedded files is serialized with YAML and used as the TOC:

  FStat = Struct.new(:name, :size, :offset)

  class Writer
    def initialize
      @files = {}

    def add(filename, contents)
      @files[filename] = contents

    def dump(anIO = nil)
      unless anIO
        ret_content = true
        anIO = StringIO.new("")
      offset = 0
      index = {}
      @files.keys.sort.each do |name|
        contents = @files[name]
        index[name] = FStat.new(name, contents.size, offset)
        offset += contents.size
      serialized_index = YAML.dump(index)
      @files.keys.sort.each{|name| anIO.write(@files[name]) }

      if ret_content


Reading is a tiny bit harder; the basic API looks like

datafs = DataFS::Reader.new(DATA)
datafs.open("blergh.dat") do |f|
  # also defined: f.eof? and f.rewind, but no other IO goodies

When a DataFS file is open()ed, a FileStream object representing a bounded section of the DATA area is returned/yielded. FileStreams respond to #eof?, #write and #rewind, and are implemented with some care so that you only get the data from the corresponding DataFS file (and not from the following ones, after you get to EOF).

  class Reader
    def initialize(io)
      @io = io
      idx_size = @io.gets
      @index = YAML.load(@io.read(idx_size.to_i))
      @initial_pos = io.pos

    def fstat(filename)

    def open(filename)
      raise Errno::ENOENT unless fstat = @index[filename]
      file_entry = FileStream.new(@io, @initial_pos + fstat.offset, fstat.size)
      if block_given?
        yield file_entry
        return file_entry

    class FileStream
      def initialize(io, offset, size)
        @io = io.dup
        @offset = offset
        @size = size
        @pos = 0

      def eof?; @pos == @size end
      def rewind
        @io.pos = @offset 
        @pos = 0 

      def read(size = @size)
        ret = @io.read([@size - @pos, size].min)
        @pos += ret.size if ret

How to require() from the DataFS

Loading code contained in the DataFS just involves a call to Kernel#eval, but reproducing Kernel#require's semantics takes a few more lines:

module Kernel
  DATAFS = DataFS::Reader.new(DATA)
  alias_method :__pre_datafs_require, :require
  def require(name, *args, &b)
    if ["", ".rb"].include? File.extname(name)
      # very naïf, 1.9 issues, etc.
      return false if $".include?(name) || $".include?(name + ".rb")

      try_and_load = lambda do |n|
        DATAFS.fstat(n) and 
          (eval(DATAFS.open(n).read, TOPLEVEL_BINDING, n) || true) and $" << n
      return true if try_and_load[name] || try_and_load[name + ".rb"]

    __pre_datafs_require(name, *args, &b)

This is just a quick hack so it could be improved a fair bit.

Compiling pure-Ruby scripts

It's somewhat unfortunate that the word compile has been taken to mean something different, so you could s/compile/compose/g or s/compile/assemble/g in the above header...

The basic idea is:

  1. creating a DataFS serialization with the desired .rb files
  2. appending the datafs_require magic (inside a BEGIN block, so it gets executed first)
  3. dumping the DataFS representation to the destination DATA area

Dependency auto-discovery would be fairly easy to implement, but I wanted to control which files get embedded (no need to redistribute the stdlib normally), so I just structured the command line as in

   ruby compose.rb main.rb lib.rb anotherlib.rb=foo.rb > main2.rb

which would put lib.rb and anotherlib.rb (renamed to foo.rb) in the DataFS contained in main2.rb.


This is our main file:

require 'thelib'

And this the library to be embedded:

require 'pp'

def foo
  puts "YES THIS IS THE foo! -> #{__FILE__}"
  pp caller

I just run

 ruby compose.rb bla.rb thelib.rb > bla2.rb

so that bla2.rb will include thelib.rb, and require 'thelib' will load the file from the DataFS.


A simple script containing all the above code: compose.rb

kinda like darb... - vjoel (2006-04-29 (Sat) 14:23:07)

Your Data FS is much nicer, but there is a similar idea in http://raa.ruby-lang.org/project/darb/.

One question: is it correct to use TOPLEVEL_BINDING? That means that required files can see (and can affect) local variables in the file they were required from. IIUC, that is not the way Kernel#require works, and darb follows that semantics.

I wish there were an easy way to implement autorequire, without using const_missing.... Any ideas?


mfp 2006-04-29 (Sat) 15:17:36

Hey, /me didn't know about darb :) Yes, your def empty_binding; binding; end looks much better. As for autoload, the one way to implement it w/o const_missing (using proxies that load the lib when they are sent a message) I can think of right now would be much more fragile.

vjoel 2006-04-29 (Sat) 16:05:26

Also, I'm wondering why you went with alias_method instead of http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/151855. It was the latter technique that I used in darb (I even put "thanks Batsman" in the code!). I can't recall for sure what the problem with using an alias would be. By the way, is there some reason my comments in this blog interface are limited to 1 line? When I hit <enter>, the comment is posted :(

mfp 2006-04-30 (Sun) 01:44:59

The problem with the

 old = instance_method(:foo)
 define_method(:foo){|*args| ... old.bind(self).call(*args) }  

idiom is that you cannot propagate blocks under 1.8 (it is possible under 1.9, where blocks accept a block arg). In the above code, I used

  __pre_datafs_require(name, *args, &b)   

so that new Kernel#require definitions that use blocks (there's at least one possible use for this) still work.

Regarding the comment interface: whereas top-level comments ("new threads") use a textarea, I limited replies to a much smaller inputbox. I thought this would make discussions more lively. And I ... was wrong, so I'm changing that right now :)

mfp 2006-04-30 (Sun) 02:18:23

Alright, a 60x2 textarea doesn't look too bad. I could also make that 60x1 to enable multi-line replies while encouraging short, to-the-point comments.

Last modified:2006/04/08 11:18:49
Keyword(s):[blog] [ruby] [data] [DATA] [compiler] [fs] [filesystem] [datafs] [snippet] [frontpage]