A better backup system based on Git
A fast, powerful backup system built upon Git and efficient, compact tools written in OCaml (faster than the C counterpart with 1/5th of the code :)
UPDATE (2008-03-31) gibak 0.3.0 released
Recent events have pushed me to get serious about backing up my data. I'm naturally inclined to use simple solutions over specialized backup systems, preferring something like rsync to a special-purpose tool. As far as "standard" tools go, however, git provides a very nice infrastructure that can be used to build your own system, to wit:
- it is more space-efficient than most incremental backup schemes, since it does file compression and both textual *and* binary deltas (in particular, it's better than solutions relying on hardlinks or incremental backups à la tar/cpio)
- its transport mechanism is more efficient than rsync's
- it is fast: recoving your data is *faster* than cp -a
- you keep the full revision history
- powerful toolset with a rich vocabulary
I'm of course not the first one to think of git as the basis for a backup system. You can find many blog articles about this, but few people have gone beyond saying "stuff you data in a git repos" and tried to fill the holes and automate things properly. The most serious projects I've found are etckeeper and git-home-history. None of them gets it entirely right for my purposes, however (more on this below).
The tool I've written retains all the advantages from Git, and supplements it in some key areas:
- metadata support
- management of submodules (nested Git repositories)
- automation of common operations; for instance, a commit consists of several steps:
- determining if some files which were committed earlier are ignored now and removing them from the index
- adding new and modified files to the index
- registering new git submodules and copying them to a special area under .git
- committing changes in the index
- compaction and optimization of the repository
Only one command is needed in practice:
gibak commit
I'm using it to save 2GB in over 200000 files, and it normally takes under 20 seconds to take a snapshot (AFAICS it scales linearly, so I expect to be able to backup 10GB in under 2 minutes...). The full power of the git toolset is available, so earlier versions can be restored with "git checkout", remote copies can be created/synchronized with git clone/push/fetch/pull, you can see what has changed with "git diff", and so on.
You can get the code with
git clone http://eigenclass.org/repos/git/gibak/.git/
The repository can be browsed at http://eigenclass.org/repos/gitweb
The metadata issue
The major thing missing in Git when used as a backup tool is support for file metadata (mostly file permissions) and empty directories (git just ignores them). git-home-history doesn't handle them at all, and etckeeper relies on metastore to preserve a snapshot of the metadata (owner, group, permissions, mtime, etc.) in a .metastore file located at the top of the git repository (/etc in the case of etckeeper).
The problem with metastore is that it doesn't know about Git's file exclusion mechanisms (.gitignore), so it ends up storing the metadata of files that aren't actually to be saved. Even though metastore is quite small and simple (only some 1500 lines of C code), extending it to honor .gitignore files seemed fairly involved because the semantics is a bit tricky (subdirectories inherit patterns from their parents and there's more than one kind of pattern) and harder to express without higher-order functions and closures.
I decided to reimplement metastore in OCaml (I named it very unimaginatively ometastore) and I'm glad I did so. I implemented metastore's functionality in one fifth of the code (1500 lines of C vs. under 300 of OCaml), the resulting executable is faster by 10% without any optimization effort and needs less memory. Even with functionality like path prefix compression (which shrinks the snapshot by 50%) and Git-like semantics for ignored files, ometastore took 4 times less code than metastore (I've since optimized glob matching, making my directory traversal routine faster than git-ls-files', so it's up to 1/3rd of the size of metastore now).
The backup system
Once I had implemented ometastore, I worked on the "gibak" script (derived from git-home-directory by Jean-François Richard), which provides the main interface to the backup system (normal git commands can also be used, but gibak automates things like removing newly-ignored files from the repository):
$gibak
usage: gibak <action>
<action> can be:
help
init
commit
eat <file_or_dir>
show <file> as of <time_spec>
ls-changed-files
ls-new-files
ls-ignored-files
ls-newly-ignored-files
ls-stored-files [as of <time_spec>]
archive-to <file>
extract-archive-to <dir>
rm-all
rm-older-than <time_spec>
<time_spec> examples:
5 days ago
2 days 2 hours 3 seconds ago
1979-02-26 18:30:00
This is the functionality I added over git-home-history:
- proper handling of metadata using ometastore (this encompasses support for empty directories too)
- submodule management: git repositories are registered as submodules and rsync'ed to $HOME/.git/git-repositories automatically
- better (faster and more correct) handling of newly-ignored files
- fixed and expanded the ls-*-files commands
Using gibak
This is how you use gibak:
$ gibak init # run once to initialize the backup system
$ vim .gitignore # edit to make sure you don't import unwanted files
# edit .gitignore files in other subdirectories
# you can get a list of the files which will be saved
# with find-git-files or gibak ls-new-files
$ gibak commit # first commit, which imports all the files not
# ignored in .gitignore
.... later ....
$ gibak commit # saves modified files, adds new ones
The first commit can be fairly slow: it took 5 minutes to import some 2GB of data over nearly 200000 files, and another 25 minutes to optimize and compact the repository (this is done automatically when you do gibak commit).
Subsequent commits, however, are very fast. It takes 15 seconds for gibak commit to determine which of those 200000 files have changed, find new ones and import them. In fact, this is faster than rsync would be, because Git stores most of the information it needs in the index, avoiding many expensive fstat syscalls.
Since gibak relies on Git's infrastructure, you can simply clone the repository in some remote host for extra safety (you can use "bare" repositories in that case as you don't need a working tree). You can also mount an external disk under ~/.git if you worry about disk crashes as opposed to hazards like fire or lightning. Git's transport mechanism is faster than rsync, and restoring your data is faster than cp -a would be. Git is a full-fledged VCS, so you have access to the full revision history.
What's left
There are basically two ways to manage submodules: copy them entirely, including their working trees, and cloning+sync'ing them (with a bare repository). With the first method (currently implemented), you won't lose uncommitted files, but you can only go one step back in time (unless you take snapshots of ~/.git/git-repositories). In the second method, nothing is lost even if you delete all your branches in an access of insanity, but the working tree (and its uncommitted files) isn't preserved.
I chose to simply rsync nested git repositories because I don't see myself deleting branches randomly and committing with gibak before I realize my mistake, but I could easily forget to commit a file and remove it by accident before the next "gibak commit".
It would be desirable to preserve both the full change history even in the event of random branch deletion and the working tree. Even though I can see ways to implement this (a git clone + another repository with auto-committed files for the working tree), they don't feel entirely right, so I'm leaving it there for now.
omake error with giback - Zeno Davatz (2008-03-05 (Mit) 09:01:15)
Hi
I just done a "git clone" of the above repository and I get the following error after running "omake":
~/.software/gibak> omake
- omake: reading OMakefiles
- omake: finished reading OMakefiles (0.0 sec)
- build . ometastore.cmi + ocamlc.opt -warn-error A -dtypes -g -I . -c ometastore.ml File "ometastore.ml", line 199, characters 17-32: Unbound value Printf.ifprintf
- omake: 44/51 targets are up to date
- omake: failed (0.1 sec, 0/6 scans, 1/23 rules, 0/66 digests)
- omake: targets were not rebuilt because of errors:
ometastore.o
depends on: ometastore.ml
ometastore.cmx
depends on: ometastore.ml
ometastore.cmo
depends on: ometastore.ml
ometastore.cmi
depends on: ometastore.ml
Thank you for your Feedback.
Best Zeno
Adam Wendt 2008-03-05 (Wed) 10:09:17
Not sure if this is the correct fix, but I modified the 'ifprintf' to be 'fprintf' which seemed to work.
Zeno Davatz 2008-03-05 (Mit) 11:07:40
Worked with latest OCaml and omake, sorry for the Noise!
Now I get:
~/.software/gibak> sudo omake Password:
- omake: reading OMakefiles
--- Checking for ocamlfind... (FAILED - no ocamlfind found) --- Checking for ocamlc.opt... (found /usr/bin/ocamlc.opt) --- Checking for ocamlopt.opt... (found /usr/bin/ocamlopt.opt) --- Checking whether ocamlc understands the "z" warnings... (yes)
- omake: finished reading OMakefiles (0.29 sec)
--- Checking if ocamldep understands -modules... (yes) --- Checking for gcc... (found /usr/bin/gcc) --- Checking for g++... (found /usr/bin/g++)
- omake: done (3.31 sec, 6/6 scans, 13/28 rules, 42/150 digests)
Zeno
Worked with latest OCaml and omake - Zeno Davatz (2008-03-05 (Mit) 11:05:40)
Thanks! Zeno
mfp 2008-03-05 (Wed) 12:21:40
Good to know. It seems Printf.ifprintf was introduced in OCaml 3.10.0. Replacing it with fprintf will probably make it work on 3.09, but in that case ometastore will always tell you the changes it is applying.
Kazelifh 2008-07-13 (Sun) 21:06:15
Hi webmaster!
nick 2008-07-28 (Mon) 21:52:07
aIbHxW hi! hice site!
nick 2008-07-29 (Tue) 01:11:08
UZFTI9 hi! hice site!
nick 2008-07-29 (Tue) 17:46:38
Vg0I81 hi! hice site!
nick 2008-07-29 (Tue) 17:48:26
QoBq3S hi! hice site!
omake failed - RyanTM (2008-03-05 (Wed) 11:44:13)
omake: Symbol `FamErrlist' has different size in shared object, consider re-linking
- omake: reading OMakefiles
- omake: finished reading OMakefiles (0.1 sec)
- omake: 9/9 targets are up to date
- omake: failed (0.1 sec, 0/0 scans, 0/0 rules, 0/33 digests)
- omake error:
File clean: line 0, characters 0-0 Do not know how to build: clean
I've never used omake before..
mfp 2008-03-05 (Wed) 12:39:36
It seems there's something wrong with your omake executable (it seems to be built against a different libfam0... maybe you upgraded it after installing omake?).
Here's a Makefile; it's a quick hack and it doesn't discover inter-module dependencies the way omake does, but it should work (don't forget the TABs when you copy&paste it):
OCAMLOPT=ocamlopt OCAMLOPTFLAGS=-inline 10 %.cmx: %.ml ocamlopt $(OCAMLOPTFLAGS) -c $< all: ometastore find-git-files find-git-repos ometastore_stub.o: ometastore_stub.c $(OCAMLOPT) $(OCAMLOPTFLAGS) -c $< ometastore: util.cmx folddir.cmx ometastore_stub.o ometastore.cmx $(OCAMLOPT) -o $@ unix.cmxa $^ find-git-files: util.cmx folddir.cmx ometastore_stub.o find-git-files.cmx $(OCAMLOPT) -o $@ unix.cmxa $^ find-git-repos: util.cmx folddir.cmx ometastore_stub.o find-git-repos.cmx $(OCAMLOPT) -o $@ unix.cmxa $^
Kazelbqp 2008-07-13 (Sun) 21:15:47
Hi webmaster!
Infinite growing? - Pau Garcia i Quiles (2008-03-05 (Wed) 14:11:42)
It looks quite interesting but I have one question: does it grow indefinitely or is there a way to remove the past of a file which no longer exists? (gibak eat seems to just remove it from future backups).
mfp 2008-03-05 (Wed) 16:20:05
Good question. A priori, nothing is ever removed for good from the Git repository, but there are at least four fronts to be explored:
- gibak rm-older-than
- shallow clones
- manual history rewriting
- discarding history and starting anew
gibak rm-older-than is almost what you want, as it can discard older parts of the history. I don't know how well it works, however: it was inherited from git-home-history and its author states he's not sure it's safe.
The second option is to use a shallow clone (git clone --depth N) and replace your .git with it. This introduces some limitations: you won't be able to clone or fetch from $HOME/.git, nor push to it, but it's OK if you don't want remote backups or if sync'ing the .git repository with e.g. rsync is acceptable.
History rewriting seems sexy a priori: you can remove all traces of the existence of a single file. The idea is essentially to
git rebase -i <commit before the one where you added a large file>
mark that commit as "edit" and remove the file (git rm LARGEFILE, git commit --amend, git rebase --continue). This involves rewriting the history since that point, so it could be slow (it will certainly involve repository compaction and re-compression).
The final way is the simplest: (optionally) clone the repository before the commit where the file was added, then rm .git and start a new backup (gibak init, gibak commit). You will lose some history, though.
I haven't explored these approaches yet. I hope history rewriting won't be too costly; it would also make it possible to squash several revisions (discarding intermediate states), thereby compressing the repository further (you could for instance squash all the patches in a month to get monthly snapshots).
mkc 2008-03-05 (Wed) 17:49:27
Cool idea!
What I'd really want is a way to do standard multi-level backups (every night for a week, every week for a month, every month for a year, keep every year forever). One way to fake this is to have multiple repos, one or two for each level, but this'd be some work to get right. What would be really useful would be a way to say "smash all of the deltas between Jan 1 and Feb 1 into exactly one delta. There might already be a tool to do this--I'm not sure.
mfp 2008-03-05 (Wed) 18:37:25
That is exactly what git rebase -i allows. Taking gibak's repository as an example, if I do
git rebase -i HEAD^^^
vim opens with the following file:
pick 7140377 gibak: added ls-newly-ignored-files, more info msgs, fixed newly-ignored logic. pick ac99b63 Fixed remaining references to git-home-history's manpage in gibak. pick 7df8c82 Assert Jean-Francois Richard's copyright in README.txt. # Rebase 46e5604..7df8c82 onto 46e5604 # # Commands: # pick = use commit # edit = use commit, but stop for amending # squash = use commit, but meld into previous commit # # If you remove a line here THAT COMMIT WILL BE LOST. # However, if you remove everything, the rebase will be aborted. #
I can just s/^pick/squash/g in all but the first line, save, and git will create a single commit holding the changes in all of them, rewriting the whole history from that point.
error - joe (2008-03-05 (Wed) 15:32:24)
gibak commit worked, but I got this error while doing the commit: Managing submodules. fatal: pathspec '.gitmodules' did not match any files
Is that something to be concerned about?
mfp 2008-03-05 (Wed) 15:50:42
No big problem :). I think the .gitmodules file was not created because no nested Git repository was found, causing git-add -f .gitmodules to fail. The fix is simply "touch .gitmodules" (you can do it manually and gibak commit, it should work).
I've pushed the fix (just adding that "touch .gitmodules" in the routine that handles submodules) to the git repository. Thank you for reporting the problem; I'd never have found it myself since I do have a few submodules in the backup.
Zeno Davatz 2008-03-06 (Don) 04:26:48
I tried "touch .gitmodules" and then done "gibak commit". Same error:
~> touch .gitmodules ~> gibak commit Committing to repository, this may take a long time Adding new and modified files. add '.gitmodules' Managing submodules. fatal: pathspec '.gitmodules' did not match any files Committing.
- On branch master
nothing to commit (working directory clean) Optimizing and compacting repository (might take a while).
Thank you for your Feedback.
Best Zeno
mfp 2008-03-06 (Thr) 09:12:31
oops, I forgot that gibak removes .gitmodules before re-creating it (which doesn't happen in your case since you have no submodules). The latest version should work though.
Zeno Davatz 2008-03-07 (Fre) 02:35:51
Thank you! This worked now. No more errors when I run "gibak commit".
Just use ZFS - James McCarthy (2008-03-05 (Wed) 16:34:00)
Or you could you know... just use ZFS, it's coming to a Linux distro near you.
Or if you can't wait use the latest version of Solaris, also free.
Adrian 2008-03-05 (Wed) 17:45:42
HammerFS looks very cool too. This is a pretty interesting comparison between ZFS and HammerFS.
mfp 2008-03-05 (Wed) 18:16:29
They are sexy indeed. According to that link, one advantage HammerFS has over ZFS is clustering (available on Lustre ZFS but not native to ZFS yet), which makes it possible to share a single FS across multiple machines, achieving failure resilience, so I assume that ZFS cannot work across machines by default.
As attractive as they are, however, Git seems to have three advantages over them as the basis for a backup system:
- repository compression and compaction (binary deltas in packs)
- ability to clone the repository efficiently (i.e., remote backups are possible)
- the extra functionality expected in a VCS: history rewriting, changeset inspection, etc.
On a more practical level, gibak has got two key advantages in my case:
- it's available right now and doesn't require an OS switch (the Debian netinst CD didn't let me choose ZFS for my home partition...)
- I don't need to buy more HDs to be protected against disk crashes (I happen to have a number of older IDE drives, close to 1TB, which are too slow to be used along with the newer ones in a ZFS setup, but just fine for backups with Git).
nested git repositories - Chris Double (2008-03-05 (Wed) 20:03:49)
The main time usage in running gibak for me is the rsync of the nested git repositories. I have quite a few very large git repositories, each containing a large build directory containing binary files I don't need to be backed up (since they are the results of a build).
I have a .gitignore for the build directory in the nested git repository but rsync ignores this of course. So the end result is a few gigibytes of wasted disk space (9GB!) and a lot of time spent rsyncing them. Any thoughts of working around this?
Zeno Davatz 2008-03-06 (Don) 01:44:31
I do not quite understand. Why do you not just ad these directories / files to your .gitignore File?
mfp 2008-03-06 (Thr) 09:20:21
As Zeno said, you can exclude the whole directory by listing it in .gitignore. The list of repositories that will be rsync'ed can be obtained as follows:
$ find-git-repos -i | tail -3 src/ghh src/gibak src/git-backup $ echo /ghh >> src/.gitignore $ find-git-repos -i | tail -3 mess/2008/03/relational src/gibak src/git-backup
I'm soon making it possible to specify the files to be ignored by rsync in a .rsyncignore file at the top level of the submodule, stay tuned.
mfp 2008-03-06 (Thr) 12:43:15
Turns out rsync's -F option does what you need; just pushed the patch that includes it.
Simply add the exclusion patterns to a .rsync-filter file inside the submodule to have rsync ignore files when sync'ing the git repository + working tree, e.g.:
$ cat .rsync-filter - *.o - *.a - *.cm* - *.s - *.omc - *.opt - *.annot
The pattern rules are documented in man 1 rsync.
Chris Double 2008-03-06 (Thr) 14:08:40
Perfect, thanks for that!
No Title - Jonno (2008-03-06 (Thr) 00:30:44)
Why not tell rsync to ignore the files? Rsync can take a list of patterns to ignore
mfp 2008-03-06 (Thr) 09:21:30
rsync is invoked by gibak, so he cannot control this directly. I'm implementing the .rsyncignore mechanism to solve this.
vcs-home - madduck (2008-03-06 (Thr) 01:42:48)
You might be interested in the vcs-home mailing list: http://vcs-home.madduck.net We would love to have you there to share your experience on gibak and join in other discussions.
mfp 2008-03-06 (Thr) 09:23:17
I'm joining as soon as I restore my mail setup (it was partially lost in the accident that motivated gibak), thanks for the pointer.
Zumastor - Pau Garcia i Quiles (2008-03-06 (Thr) 06:00:03)
re ZFS and Hammer, you should also try Zumastor: http://zumastor.org
It works on Linux on top of any filesystem. It works very well with hardware RAID. Currently there is a bug in trunk which makes it write very slowly if using software RAID but that will be fixed in a few days.
mac os metadata - robel (2008-03-06 (Thr) 10:00:28)
How about Mac OS metadata? Is it getting backed up?
mfp 2008-03-06 (Thr) 12:48:07
Even though ometastore's file format supports it, extended attributes aren't saved currently because I have yet to wrap the listxattr/getxattr syscalls (I didn't have any extended attrs to backup myself). I'll get there shortly.
pdumpfs - buggs (2008-03-06 (Thr) 11:22:11)
Another nice ruby tool for daily backups is pdumpfs:
http://raa.ruby-lang.org/project/pdumpfs/
help needed to compile - Jonno (2008-03-06 (Thr) 23:44:24)
I am running leopard and installed ocaml and omake via macports. The ocaml lib dir is /opt/local/lib/ocaml
What do I need to put in the OMakefile so that gcc files those files?
Everything I have tried has not worked.
I get this error: $ omake
- omake: reading OMakefiles
- omake: finished reading OMakefiles (0.02 sec)
- scan . scan-c-ometastore_stub.c + gcc -I. -MM ometastore_stub.c ometastore_stub.c:4:27: error: caml/mlvalues.h: No such file or directory ometastore_stub.c:5:25: error: caml/memory.h: No such file or directory ometastore_stub.c:6:23: error: caml/fail.h: No such file or directory
- omake: 20/54 targets are up to date
- omake: failed (0.14 sec, 1/2 scans, 1/4 rules, 5/80 digests)
- omake: targets were not rebuilt because of errors:
<scanner scan-c-ometastore_stub.c>
depends on: ometastore_stub.c
Anonymous 2008-03-07 (Fri) 02:16:51
I should have RTFM ...
I added this line to the omakefile
INCLUDES += /opt/local/lib/ocaml
mfp 2008-03-07 (Fri) 09:40:26
I'm adding this as a comment in the OMakefile since I suspect more people might run into this very issue. Thanks.
.gitignore - robel (2008-03-07 (Fri) 07:02:14)
Hi! I have added "Library/" to my ~/.gitignore, but I see that subderictories are still being backedup. Can you please help me? Thanks in advance.
mfp 2008-03-07 (Fri) 09:37:31
Make that "/Library" (nothing will match "Library/"). The gitignore rules are a bit tricky, and man 5 gitignore isn't entirely clear; there are things I only grasped by reading the sources.
A leading slash is discarded and matches the beginning of a pathname while forcing fnmatch-based shell glob matching. It thus makes the pattern "local" in some sense.
# ignores *.c in the current dir and its subdirs *.c # ignores *.c only in the current dir /*.c # ignores foo in the current dir and its subdirs foo # ignores foo only in the current dir /foo
You can use find-git-files -s to get the list of files that will be included in the next snapshot, and gibak ls-new-files to get those that will be added.
Anonymous 2008-03-17 (Mon) 09:57:27
gibak ls-new-files
gives
usage: git-ls-files [-z] [-t] [-v] (--[cached|deleted|others|stage|unmerged|killed|modified])* [ --ignored ] [--exclude=<pattern>] [--exclude-from=<file>] [ --exclude-per-directory=<filename> ] [--full-name] [--abbrev] [--] [<file>]* Use 'gibak commit' to store them
Any ideas?
$ git --version git version 1.5.2.5
sean 2008-03-23 (Sun) 22:38:17
I noticed this as well on OSX. I appears that you have to have git version 1.5.4.x. They have changed some of the flag options.
Also, I am finding I have to change a few things in the gibak script in order to get it to work on OSX.
I had to comment out the section that calls getent since it it doesn't exist on OSX. Also I had to put single quotes around the pathnames of the rsync section at line 84 in order to handle paths with spaces in the name.
There also seems to be a problem where something makes a call to time with the -r option, which doesn't exist in OSX either. When I do find it, I will likely replace it with a call to a Perl or Ruby statement.
sean 2008-03-24 (Mon) 00:49:25
Sorry, what I really needed to change in order to handle submodules with spaces in the path. I had to add the following line to the top of the __handle_git_repositories method right around line 75
IFS=" "
And I changed line 239 to read:
git-commit $modifier -m"Committed on $( date +"%a, %d %b %Y %H:%M:%S %z" )" -- $@
sean 2008-03-24 (Mon) 00:53:26
And changed line 285 to:
local removal_date=$( date +"%a, %d %b %Y %H:%M:%S %z" )
mfp 2008-03-27 (Thr) 05:43:35
gibak has been developed against git 1.5.4.2. The problem with git-ls-files seems solvable (I just have to use --modified, --other... instead of -m, -o, etc.), but git's interface keeps changing across versions and it'll be hard to ensure that gibak works with all of them.
Sean: thank you, I'll commit your patches and proceed to test on OSX (I'm also adding support for extended attributes; the code is written and I only need to update the build system).
sean 2008-03-27 (Thr) 08:58:07
It's great to hear that you will be looking into testing on OSX a bit more. I'm still having trouble with commits. The last problems seems to be with ometastore. It always fails at the end for me usually giving very cryptic messages such as
Fatal error: exception not found
I'd love to get this fully working on OSX, because TimeMachine is just not cutting it for me. ;) I really like the idea of only storing binary deltas, much more efficient. And the fact that your script removes ref to any objects that are added to .gitignore is very beneficial.
mfp 2008-03-28 (Fri) 13:38:53
You can get the backtrace for the Not_found exception by setting the OCAMLOPTFLAGS in OMakefile to
OCAMLOPTFLAGS += -g -inline 10 -S
rebuilding and setting the OCAMLRUNPARAM environment variable to "-b" before you run gibak/ometastore.
I'm trying to reproduce the problem on OSX.
Jeremy Rayman 2008-04-08 (Tue) 01:28:34
Awesome thing you've built in gibak, thank you! I hit the same thing as the OP, here's the trace from those directions:
% ometastore -x -s -i --sort : /home/jlr Fatal error: exception Not_found Raised at file "ometastore.ml", line 37, characters 38-52 Called from file "util.ml", line 9, characters 51-54 Called from file "ometastore.ml", line 64, characters 20-44 Called from file "folddir.ml", line 51, characters 30-43 Called from file "list.ml", line 69, characters 12-15 Called from file "util.ml", line 5, characters 14-17 Re-raised at file "util.ml", line 5, characters 62-63 Called from file "folddir.ml", line 55, characters 39-133 Called from file "list.ml", line 69, characters 12-15 Called from file "util.ml", line 5, characters 14-17 Re-raised at file "util.ml", line 5, characters 62-63 Called from file "folddir.ml", line 55, characters 39-133 Called from file "list.ml", line 69, characters 12-15 Called from file "util.ml", line 5, characters 14-17 Re-raised at file "util.ml", line 5, characters 62-63 Called from file "folddir.ml", line 55, characters 39-133 Called from file "list.ml", line 69, characters 12-15 Called from file "util.ml", line 5, characters 14-17 Re-raised at file "util.ml", line 5, characters 62-63 Called from file "folddir.ml", line 55, characters 39-133 Called from file "list.ml", line 69, characters 12-15 Called from file "util.ml", line 5, characters 14-17 Re-raised at file "util.ml", line 5, characters 62-63 Called from file "ometastore.ml", line 71, characters 16-64 Called from file "ometastore.ml", line 303, characters 58-78 Called from file "ometastore.ml", line 314, characters 9-16 [1] 88157 exit 2 ometastore -x -s -i --sort
ometastore.ml line 37 is (as of Apr 7):
let group_name = memoized (fun gid -> (getgrgid gid).gr_name)
the error (characters 38-52) on that line comes from (getgrgid gid)
getgrgid returns Not_found when the group ID it's looking up doesn't exist. Well it turns out on my system I've done a bunch of 'cp -rpv'ing from various other computers, preserving the wrong group ID. In order to fix the group ID's, I did:
chgrp -hR myusername ~/
That fixed all the group ID's in my home dir and ometastore completed successfully.
mfp 2008-04-25 (Fri) 11:11:00
Thank you! I've just pushed a commit that fixes this (it tries to use the name associated to the current gid or $USER if everything else fails).
git errors - das16 (2008-03-07 (Fri) 21:32:27)
I'm a git newbie, so maybe I'm doing something wrong. When I run
git clone http://eigenclass.org/repos/git/gibak/.git/
it thinks for a while and comes back with
walk eae04a9f68c259ef76332c5d6f2b5a914ba4a886 Getting alternates list for http://eigenclass.org/repos/git/gibak/.git// Getting pack list for http://eigenclass.org/repos/git/gibak/.git// error: Unable to find 0000000000000000000000000000000000000000 under http://eigenclass.org/repos/git/gibak/.git// Cannot obtain needed object 0000000000000000000000000000000000000000
Any ideas as to what's wrong?
mfp 2008-03-08 (Sat) 04:14:01
It works for me (git version 1.5.4.2) and AFAIK for the few hundred people who have cloned the repos. What's your git --version? If upgrading git doesn't help, you can download the files manually from http://eigenclass.org/repos/git/gibak/ .
rg8azf4gx2 2008-03-15 (Sat) 03:06:14
tdn33867 783dtirnut okcz6iov
Venti. - Vladimir Sedach (2008-03-10 (Mon) 13:06:10)
Because of the way git works (immutable tree of blobs identified by SHA digests), this scheme is actually very similar to Plan9's Venti backup filesystem (http://plan9.bell-labs.com/sys/doc/venti/venti.html).
Virtual Machines - Stu (2008-03-11 (Tue) 06:23:25)
This sounds interesting; would it be feasable to use it to keep vmware disc images in, or would this be really slow.
(Really looking for a way to version control virtual machines, so I can pull the differences down over the network)
mfp 2008-03-17 (Mon) 13:21:23
git uses binary deltas so it should be able to version disc images quite efficiently. I haven't tested it myself, but if xdelta works well, git should also. You'll probably need to set pack.windowMemory in order to limit memory consumption when performing delta compression.
File "going away" kills it - Bruce Dillahunty (2008-03-24 (Mon) 15:34:20)
I was running a backup on a large directory and while it was running, one of the files was deleted.
fatal: .XXXXXXXXXXXXXXXX: unable to stat (No such file or directory) Could not complete addition of files to history store!
Any ideas (other than don't let that happen)?
mfp 2008-03-27 (Thr) 04:44:13
It seems to be dying here:
git-add -v . \
|| die "Could not complete addition of files to history store!"
I believe git-add should work OK if you run gibak commit again. If it doesn't, run git reset and try again. I don't see any reason why git-add would fail, but if it does we'll have to delve into git's sources.
sean 2008-03-27 (Thr) 09:07:39
I've seen where the add stage of gibak can fail, particularly if you are starting your backup with a large amount of files. What seems to happen is that sometime between ls-new-files and when the actual file is added, that file either is moved or removed. This most certainly happens when I have Mail open, which frequently moves an deletes new mail files that arrive.
mfp 2008-03-31 (Mon) 04:41:30
What I meant is that I don't see any reason for git-add to fail the second time you run it (assuming that nothing's messing with the files while it executes).
When you commit, gibak doesn't build a list ls-new-files; it just runs git-add . instead, which is what was failing. Here's my guess of what is happening:
- git-add opens the directory and iterates over its contents (readdir), appending to the list of files to be added (it doesn't have to use stat(2) on Linux at this point, since there's a non-standard extension that allows you to know whether an entry returned by readdir(3) is a directory).
- it then processes the list: for each item, stat(2) it to get the perms, open(2) it and add it to the repository.
git-add could ignore errors in the second step (files moved or deleted in between), but aborting is perfectly reasonable.
Size of backups over time - Tim (2008-03-26 (Wed) 12:24:41)
Normally when you backup data, you only want to keep, your last 30-90 days of incrementals. Im not too familiar with git, but, by using git, wouldnt your backup repo (on both the live machine, and backup machine), continually grow?
Im really interested in gibak, but I have 2 problems: 1. The repo will continually grow even if data stays the same 2. The growing repo is not only on the remote backup machine, but also on the local machine.
If your data was, say 50Gb, after a year of changes to that data, your repo+data on both the live machine and remote backup machine could be now up to 200Gb, while yaour data is still only 50Gb.
Is there a way to combat this? Such as trimming the git repo from X days ago and beyond?
tim-remove-this-[at)-simulat-dot-com
mfp 2008-03-27 (Thr) 05:17:54
You can use git's history rewriting capabilities (git rebase -i) to coalesce commits. For instance, you can combine one month worth of (incremental) backups into a single commit, multiple 1-month commits into semestral commits, and so on. If you want to remove a large file from the history, you can delete it, save that change with "git commit --no-verify" (the file deletion should be the only thing in that commit; --no-verify is used to keep the .ometastore file unchanged), and then use git rebase -i to squash that commit with the one where the file was added (be careful when rebasing if you have installed gibak in the directory being saved...). (Another option is the "rm-older-than" command, inherited from git-home-history, which removes older data, but it is untested and might eat your data.)
If you have cloned the repository remotely, you'll have to either perform the same history rewriting (with exactly the same commit messages) or delete/reset the clone to the last commit before the rewritten part and re-clone/pull to it.
As for the second issue, the solution is twofold:
- you can use a "bare" remote repository, with no tree checkout, by cloning the repository with "git clone --bare". In the example you gave, this would reduce disk usage to 150GB.
- instead of using a cloned remote repository, you can mount a remote disk in ~/.git, thereby using 50GB in your local disk for the latest data plus ~50GB for the .git directory stored in another disk (actually, this might be less, thanks to compression and the inherent duplicate file coalescing, especially if you use history rewriting to remove intermediate steps) for a total of 100GB.
If you store your .git directory in another disk and the latter fails, you'll only keep the last version of your data (in the local disk). This is also what would happen if you were using traditional incremental backups (using rsync or some similar system) with a single copy. If you use a clone, more space will be needed but in exchange you'll be able to recover any past version of your data no matter which medium fails. If you want extra safety, you can use multiple (bare) clones in remote sites.
.gitignore - stat (2008-04-03 (Thr) 05:07:56)
Hi! I understand that my question is not directly related to gibak, but still perhaps you can help me. I have in a home root a directory called "bin". I want to exclude it, but still keep some nested directory "bin/test"
I added to .gitignore
/bin
/bin/test
But git ignores it all together.
Can you help me with my problem, please?
Thank.
Bruce Dillahunty 2008-04-17 (Thr) 09:32:17
Can't you put "!" in front of the /bin/test and get it to include that? I'm not sure how that works with nests, but I thought that was the syntax for negating an ignore for special cases
mfp 2008-04-25 (Fri) 10:33:05
If you exclude /bin at the top-level git will not scan it at all, and thus skip /bin/test too.
The way to achieve what you want is to create a .gitignore file under bin/:
/* !/test
This excludes everything in bin/ except test.
other directories than ~ - skolima (2008-05-06 (Tue) 14:59:36)
How can I use gibak to version other directories than ~ ? I'd like to cron 'gibak commit' to version all user's homes.
kentling 2008-05-27 (Tue) 15:34:25
Anyone? I would like this, too. I have too much stuff in my home directory, not all of which I want backed up.
sMark 2008-07-06 (Sun) 16:55:18
Seems like .gitignore can address a lot of the "too much stuff in the home directory" issue.
GUI from git-home-history - matt (2008-05-16 (Fri) 15:23:27)
Have you seen the "newest" versions of git-home-history at http://jean-francois.richard.name/ghh/? They are not completely new - the newester commit happend in december 2007.
Anyway, the reason I mention this is the addition of a nice little GUI to git-home-history that wasn't included from the beginning and I therefore haven't noticed before. I especially like the restore dialog. It's written in Python and uses GTK2. It'd be easy to port it to gibak: basically s/ghh/gibak the way I see it. I'll look into it...
Multi-level snapshots with git-rebase - matt (2008-05-16 (Fri) 16:25:06)
I was playing around with git rebase for a little while. However I couldn't get the desired results of squashing several commits into one to reduce repo-size. I was basically trying to emulate some multi-level snapshot behaviour (hourly, daily, monthly, etc. snapshots) in order to eventually integrate it in cron scripts.
Here is what I did:
Initialize new git repository
git init
Create first file and commit it
dd if=/dev/urandom of=file1 bs=10k count=1024 git add file1 git commit -m "add file1"
Create second file, commit it, delte it and commit again
dd if=/dev/urandom of=file2 bs=10k count=1024 git add file2 git commit -m "add file2" rm file2 git commit -a -m "delete file2"
By now the repo-size should be around 20MB and the working dir 10MB. Let's rebase the last commit onto the first commit to get rid of the middle commit. (Replace pick with squash in second commit.)
git-rebase -i HEAD^^ git-gc --prune
After garbage collection I was expecting to see the repo-size shrink to 10MB because now file2 was added and deleted in one single combined commit. However the repo stayed at 20MB size... Control case: Adding and subsequently deleting a file and then commiting the whole thing gives the desired effect of actually not including said file in the repo. This must mean that my rebasing was not equivalent with a single commit that adds and deletes the same file.
Similarly I modified file2 instead of deleting it. By combining the two last commits (create and modify file2) I expected to delete the point in history where file2 is created but not modified yet. However the repo size doesn't shrink either in this case.
Therefore the way I was rebasing would reduce the number of commits but not the size of the repository. It's not removing any data from the history and thus defeats the purpose of using rebasing to create multi-level snapshots. Anybody see what I'm missing?
Is there a manpage for gibak? - till (2008-05-17 (Sam) 11:08:03)
gibak --help displays at the end of the output:
see 'man gibak' for more information
But I cannot find any manpage for gibak in the tarball.
it's the 1940s all over again - Paul Phillips (2008-05-27 (Tue) 12:22:39)
Trying to clone my home dir between two OSX 10.5.2 systems. Running ometastore on the clone sends all the timestamps back into the 1940s. I think this must involve an unsigned int being treated as a signed one, since we're somehow before 1970... interestingly on reboot all files were reset to dec 31 1969.
I hacked around ometastore a little bit trying to fix it myself, but I don't know ocaml and am rusty in C so it wasn't as fruitful. I did notice that utime(3) is declared obsoleted on OSX by utimes(2).
Another bug: ometastore chokes and quits with error "utime" when trying to restore permissions if it encounters a symbolic link that points to a non-existent file.
more on ometastore - Paul Phillips (2008-05-27 (Tue) 13:28:54)
Is the ometastore_stub.c file in the repository generated from an IDL file? This is all foreign to me but I found this statement here.
Special attention must be paid to integer fields or variables. By default, integer IDL types are mapped to the Caml type int, which is convenient to use in Caml code, but loses one bit when converting from a C long integer, and may lose one bit (on 32-bit platforms) when converting from a C int integer. When the range of values represented by the C integer is small enough, this loss is acceptable. Otherwise, you should use the attributes nativeint, int32 or int64 so that integer IDL types are mapped to one of the Caml boxed integer types. (We recommend that you use int32 or int64 for integers that are specified as being exactly 32 bit wide or 64 bit wide, and nativeint for unspecified int or long integers.)
Losing one bit sounds like a real problem when it comes to seconds since 1970, since (2038-1970)/2 < 2008.
Backup other directories - jimk (2008-06-02 (Mon) 11:44:36)
I would like to use gibak to backup the /etc directory and possibly others. gibak appears to be hardcoded to backup only the home directory. (This may be copied from git-home-history.) Is there any way around that?
Would it be safe to change this line in the gibak script?
cd ~
Otherwise, would creating symlinks in my home directory work?
Merge of ometastore data - Jörg Sommer (2008-06-08 (Son) 16:11:37)
Hi,
how do you do merges of branches? What happens with the metastore data? Metastore save the data as binary which makes it impossible for git to help. The script setgitperms.perl shipped with the git source saves it's data in a text file. So git can do the merge for you.
Does ometastore has an ascii format?
gNfkSrGciAaQgJOYAn - VanDenn (2008-07-05 (Sat) 12:39:23)
Just do it: ,
Clone, metadata - sMark (2008-07-06 (Sun) 17:00:49)
Hi, I'm testing out gibak on my machine (OSX 10.5.4; git 1.5.6; ocaml 3.10.2; omake 0.9.8.5).
I just used git clone to copy my .git (which is on an external hdd) to another external hdd, however the working copy that it built there had the time stamps all from the time of copy -- how do I actually restore the info in .ometastore?
Also:
I had the mmap error reported above, it appears to be an issue with git and large files, adding more media to my .gitignore caused the problem to go away.
I also ran into the ometastore getxattr bug -- running off of the latest checkout seemed to fix it.
sMark 2008-07-06 (Sun) 22:01:05
Hi, refining my issue; I now see how hooks/post-* is used to run ometastore. The issue is that when I run:
ometastore -v -x -a -i
I receive:
Fatal error: exception Failure("float_of_string")
fascinating idea - rarecacuts (2008-07-07 (Mon) 01:20:28)
Fascinating idea-- storing your whole home directory in a version control system.
I see a lot of comments to the effect that "gitbak storage requirements will only grow over time; isn't that a serious problem?" I think what these people are forgetting is that disk storage capacities are still doubling every few months. As long as your home directory doesn't beat Moore's law, you're fine :)
run as root - pabloe (2008-07-09 (Wed) 07:27:31)
is it dangerous to run gibak as root? i want to backup dirs like /etc or /var.
pablo
function ...() { } in gibak - danko (2008-07-10 (Thr) 04:54:40)
either function <name> {...} or <name>() {...}, NOT both: http://wooledge.org:8000/BashPitfalls#head-e3cb700b456b5523f7eaebd64454dbcdc7d16879
Bash Scripting Bugs. - lhunath (2008-07-24 (Thr) 00:44:37)
OK, I just looked over your gibak bash script; and here's a few things I noticed that you should fix:
1. Don't use which. Use type -P. type is a bash builtin (no bash forking and external process invoking), and lots of distros (or other *NIX based OSes, or stupid admins) like to mess with which to make it output some "extras". Additionally; which does not set a reliable exit status if the command was not found; type does. This can be invaluable (and I see you actually do rely on which's exit status being set!).
2. Almost additional to point 1, don't use external processes for things Bash can do easily internally. Such are: basename (use "${var##*/}", pwd (use "$PWD"), hostname (use "$HOSTNAME").
3. Now; we *are* using bash here, right? So.. stop using sh-isms. Putting #!/bin/bash at the top of the script breaks any sh portability; obviously, so stop pretending to be sh compatible inside your code. Use the safer and more powerful [[ to do your tests, not the crappy and unsafe (when not using propper quotes) test or [. [[ even provides you with glob pattern matching which you will undoubtedly be able to use. See http://wooledge.org:8000/BashFAQ#head-b1c292ce2f4fdfa6a7b5389da1892d65b6f37cda
4. In some parts of the code you seem very good at quoting (too good, even); then in other parts you miss quotes. I even see you quoting things that don't need quotes and then leaving out quotes in places that ABSOLUTELY need quotes. It's like you tried to sabotage yourself. In any case; to make things perfectly plain:
Assignments don't need quotes unless you're going to have to put whitespace in there literally: foo=$1 # safe. foo=$1 $2 # obviously not safe; sets foo to $1 and runs the command that is the FIRST WORD in $2. foo="$1 $2" # good.
Inside [[ you'll also not need quotes. That's because [[ is a bash keyword and bash does special parsing in there. You needn't worry about word splitting: $1 = foobar$2 # safe. test -e $1 # NOT safe! Tests whether the first word in $1 denotes an existing file. All other words are considdered additional test expressions!
For the rest; quote EVERY parameter expansion! mkdir $1 # NOT safe! Makes a directory for every word in $1. mkdir "$1" # safe. echo foo > $log # NOT safe! Echos the word foo followed by *all but the first word* from $log to the file denoted by the first word in $log. (Either that or errors with "Ambigious redirect" if $log contains multiple words). echo foo > "$log" # safe.
ALSO $@ needs quotes! rm $@ # NOT safe! Deletes every file denoted by every word in each of the positional parameters. I actually saw you do this, not a good idea; At All. rm "$@" # safe.
Assigning $@ to a sring is also a bug. foo=$@ # Assigs the ARRAY $@ to the STRING foo. No-no. foo=("$@") # Makes an array foo with the elements in the array $@.
Also remember to quote other expansions, like command substitution: test $(git-stuff) # broken. $(git-stuff) # good. test "$(git-stuff)" # good, too, though you really should be using [[.
Remember that users' homedirs contain files with spaces far more often than general UNIX directories. And users' homedirs is not something you want to be sloppy on file management with! Especially in a backup solution! So remember to quote properly.
5. I always find it odd when people use if's to test $?. Just use control operators: Silly: command if test $? = 0; then echo success; fi Good: if command; then echo success; fi Also Good: command && echo success || die "Eek! That was bad."
See http://wooledge.org:8000/BashGuide/TheBasics/TestsAndConditionals
For all the comments, I've also seen some pretty good bash scripting. So I'll definately be using this once you've fixed those bugs (and some of them definately are). Also, remember that should you ever need any help, have a question, or needed some advice on bash scripting; feel free to join #bash on irc.freenode.org and ask away. I'm normally always there as "lhunath".
Keep up the good work! I've been dying for a solution like the one you've created.
A somewhat different approach to ignoring files. - lhunath (2008-07-24 (Thr) 00:49:21)
I've been meaning to use a somewhat different approach to ignoring files in my homedir.
Instead of using .gitignore as an exclusion filter; I'd actually like to exclude all files except those I manually add to my repository.
I can do this in two ways; the first, and cleanest, is by setting git-config status.showUntrackedFiles no
Then I can just add whatever files I want to include and git-status isn't going to show me any of the other files.
Unfortunately, gibak commit DOES add them. Very unfortunate indeed. On to the second option.
Adding '*' to .gitignore and then unignoring each file manually by adding it to .gitignore and prefixing it with a !. This includes all directories leading up to those files.
This seems to work so far, but it's very messy. I'd love for the first option to be feasable. Would you mind terribly looking into this?
BbQcExKFwRwgMDLRnt - nick (2008-07-27 (Sun) 17:31:20)
3aM6zs hi! hice site!
Branching? - Adam (2008-07-28 (Mon) 07:06:02)
Great application you've written there, I was wondering whether it's possible to checkout new branches when using gibak? I ask because I already maintain an rsync'd home directory between my different computers (for ease and backup) and hoped that I might be able to create branches per machine (e.g. laptop/desktop branches)
- 2093 http://reddit.com/r/programming
- 368 http://news.ycombinator.com
- 337 http://del.icio.us/popular
- 295 http://del.icio.us
- 182 http://popurls.com
- 127 http://reddit.com/?count=25&after=t3_6ay4q
- 110 http://reddit.com/r/programming/info/6ayvg/comments
- 109 http://reddit.com/?count=25&after=t3_6ay1z
- 95 http://reddit.com/?count=25&after=t3_6ay61
- 93 http://vcs-home.madduck.net
Keyword(s):[blog] [frontpage] [git] [backup] [ocaml]
References:[gibak 0.3.0 (backup tool using Git): OSX support, extended attributes, bugfixes]