eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Purging referrer URLs concurrently

After writing a "pooling executor" that assigns tasks to a number of handlers in parallel, filtering referrer URLs somewhat efficiently becomes easier. Checking multiple referrers concurrently helps maximize bandwidth usage, which is quite important when you have to fetch ~8000 pages. Were this done serially, the process could easily take a couple hours or more.

The script described below helped me remove nearly 95% of the referrer URLs, going from 13595 (7909 unique) to 933 (653).

Task description

My HTTP referrers (and the corresponding hits) are stored as serialized hashes in a number of files, marshalled with TMarshal, AMarshal's elder (yet simpler) sibling:

 {
 "http://1470.net/mm/mylist.html/272?date=2006-3-1" => 1,
 "http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/181885" => 21,
 "http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/181963" => 2,
 "http://datavet.htu.se/kurser/KVC010/index.html" => 5,
 "http://dev.rubyonrails.org/ticket/4269" => 1,
 "http://fi.wikipedia.org/wiki/Uutisryhm%C3%A4" => 2,
 "http://ja.reddit.com/goto?id=2i9j" => 5,
 "http://ja.reddit.com/user/matz/" => 1,
 "http://lists.debian.org/debian-devel/2006/03/msg00875.html" => 1,
 "http://moonrock.jp:23000/articles/2006/03/03/rcov" => 1,
 #....
 }

I just have to iterate over all these files, verifying if those referrer locations actually exist and seem OK, and overwrite the data.

Checking a referrer

Some URLs can be discarded right away: search engines, common RSS aggregators (mostly bloglines, which I get lots of hits from), google groups... I also whitelisted a few patterns to make the script a bit faster, and cache the results to avoid downloading a page twice.

GOOD_REFERER_RE = %r{^http://(www\.)?eigenclass\.org/hiki\.rb\?[^?;]+$|
                     http://(www\.)?anarchaia\.org}x
BAD_REFERER_RE = %r{a9\.com/|google.*/search|mail\.google\.|
                    bloglines\.com|eigenclass\.org|
                    search\?q=cache|search\.msn\.|blogsearch\.google\.|
                    del\.icio\.us|search\?q=|
                    comp\.lang\.ruby
                    }x 
LINK_TEXT = %r{eigenclass\.org}

def check_referer(referer, count, ok_referer_hash, referer_info_cache)
  record_good = lambda do
    ok_referer_hash[referer] = count
    referer_info_cache[referer] = true
  end
  record_bad = lambda{ referer_info_cache[referer] = false }

  record_bad.call unless %r{http://}i =~ referer

  begin
    if referer_info_cache.has_key?(referer)
      puts "CACHE HIT: #{referer}"
      record_good.call if referer_info_cache[referer]
    elsif GOOD_REFERER_RE =~ referer && BAD_REFERER_RE !~ referer
      record_good.call
    elsif BAD_REFERER_RE =~ referer
      # ignore them
      puts "DISCARDING: #{referer}"
      record_bad.call
    else
      times = 0
      begin
        Timeout.timeout(10) do
          open(referer) do |is|
            # could use a more sophisticated test (e.g. content classification)
            if is.read(100000)[LINK_TEXT]
              record_good.call
            else
              record_bad.call
            end
          end
        end
      rescue Timeout::Error
        times += 1
        if times < 3 
          retry
        else
          raise
        end
      end
    end
  rescue Timeout::Error
    puts "GET #{referer} timed out."
    record_bad.call
  end
end

I'm using Timeout and retrying up to 3 times for the lazy servers out there. The "check" isn't the epitome of sophistication, but it's very effective.

Concurrent verifiers

The key to making this run faster is checking several URLs at a time. My PoolingExecutor handles that easily, creating the threads and throttling HTTP requests for me:

require 'fileutils'
require 'poolingexecutor'
require 'open-uri'
require 'timeout'

def purge_bad_referers(filename, good_referrers = {})
  referer_hash = eval(File.read(filename))
  ok_referer_hash = {}
  executor = PoolingExecutor.new do |handlers|
    15.times do 
      # we could as well register a plain Object and use it as a "token",
      # with the actual code in the executor.run { ... }  block
      handlers << lambda do |referer, count|
        check_referer(referer, count, ok_referer_hash, good_referrers)
      end
    end
  end
  referer_hash.each do |referer, count|
    executor.run do |handler|
      puts "Processing #{referer}"
      puts "Keeping #{referer}" if handler[referer, count]
    end
  end
  executor.wait_for_all
  File.open(filename, "w") do |f|
    f.puts "{"
    ok_referer_hash.keys.sort.each do |key|
      f.puts "#{key.inspect} => #{ok_referer_hash[key].inspect},"
    end
    f.print "}"
  end
end

All that's left is iterating over the files I have to process:

good_referrers = {}
ARGV.each do |fname|
  FileUtils.cp(fname, "#{fname}.bak")
  puts
  puts "=" * 80
  puts fname
  puts "=" * 80
  purge_bad_referers(fname, good_referrers)
end


Last modified:2006/04/17 04:57:38
Keyword(s):[blog] [ruby] [concurrency] [purge] [http] [referer] [snippet] [subpar] [frontpage]
References:[Ruby]