Squeezing More Onto a Backup Hard Drive

I've got a back up hard drive that is running out of room, but I don't want to replace it.  I have a more interesting plan.

Given a file system tree, I know that there are files that are identical throughout.  Since the tree represents a backup, the files are not going to change.  If I could take groups of identical files and hard link them together I could save a lot of disk space.

Here's my solution:  I've taken the Socorro processor and replaced its iterator, the transform function, the source and the destination:
  1. the iterator is now a file system walker that just yields the full pathnames of each file in the target file system.
  2. the transform function is now a takes a pathname, uses os.stat to fetch the file's inode, then starts a subprocess to run md5sum on the file to produce a hash
  3. the destination is now a table (called "files") in a Postgres database of this form:
  • pathname text ,
  • inode text ,
  • checksum text

Running the program, and setting it to 16 threads, it chewed through the four terabyte file system in about seven hours making a md5sum for every file and saving the information about the file in the table in Postgres.  Then I ran this query:

with 
A as (select checksum, inode, count(inode) as n from files 
      group by inode, checksum order by 3 desc),
B as (select checksum, max(n) as m from A group by checksum),
C as (select b.checksum, a.inode from B b join A a 
      on b.checksum = a.checksum and b.m = a.n),
D as (select f.pathname, c.inode from files f join C c 
      on f.checksum = c.checksum and c.inode <> f.inode)
select d.pathname, max(f.pathname) from D d join files f 
      on d.inode = f.inode group by d.pathname;


The output of this query is a series of rows of pairs of file names.

/media/backup001/2010.08.01/images/img_2938.jpg
/media/backup001/2012.05.04/blog/photos/palouse-falls.jpg


These file names are then used to essentially do this shell command:

    ln file2 file1


This replaces each redundant file with a hard link.

My nearly full 4TB hard drive now has 1.5TB free!

Let's go back and re-examine that query.  It may not be the most efficient or elagant query, but it was fun to write.  I hope it does what I really think it does.  It uses four common table expressions, an alternative to subqueries that generally makes SQL easier to write and understand.

A) this first query goes through the "files" table identifying all the files that are already hard linked together and giving a count of how many files participate in the hard link.

B) this query then identifies the hardlink groups with the most participants associated by the checksum.

C) this query identifies the inode for the each of the hardlink groups for which the number of participants is maximized for a given checksum.

D) this query identifies each of the pathnames for each file that does NOT participate in the hard link group with the maximum number of participants. The inode in the select list is the inode of the hard link group with the maximum number of participants.  In other words, the output of this query is the files that I want to make into hard links to the listed inode.  I don't have a way of making a hard link by specifying an inode, so I must now find a file name associated with the target inode:

E) (unnamed) the final query takes the inode and finds a pathname that uses it.  The final output is the pathnames of the two files that I need to hardlink together.

Are you a DBA?  How could I improve my query?  Does it really do what I think it does?