Finding unique files within Backups.backupsdb

Mikuro · Mar 23, 2008

I'm writing a program that processes data in Time Machine's Backups.backupsdb folder. The important thing is that there's no reason for me to process the same file twice, and Time Machine stores multiple links to the same files one in every single incremental backup.

I can test whether a file has already been processed by recording the inode numbers of every file and comparing new files against the list (I use an NSMutableSet for this task). The problem is that even just going through the directories, without doing any significant processing of the duplicate links, takes an obscene amount of time.

I estimate that there are about 20 million files (and 3.5 million folders) in my backupsdb folder, but only about 500,000 unique files. Going through all those duplicates is just not reasonable. It would literally take all day to process the folder.

Is there any FAST way to ignore duplicate links? Maybe by accessing files directly by inode number instead of path? Is there any way to get a list of all paths pointing to a given inode, or of all inodes within a given directory?

I can't think of any way to get this done. Any ideas?

ElDiabloConCaca · Mar 24, 2008

Each unique file on the backup is a file once and only once, and every other time in the backup, appears as a link.

So, instead of processing all files, simply process real files, and ignore links altogether (done with a simple UNIX 'file' test). Go through the backup reverse-chronologically, and you're guaranteed to always process the newest version of a file first.

Edit: Hmmm... I just thought about my algorithm, and it may be severely flawed... tee hee... if you only process all real files and skip links, then you'll end up having to touch all files in the backup... not very efficient.

With the UNIX 'find' command, you can use the -f switch to ignore symbolic links. You could run the UNIX 'find' command once on your backup drive at the beginning of the program, printing out all paths to all real files. You could then parse this list of files and do your processing.

Mikuro · Mar 25, 2008

Thanks for the ideas. I actually do want to process all files, including old versions, so that's not a problem. That's the whole point, really — I'm writing a program to make a number of Burn Folders out of larger folders. It's just that dealing with Time Machine's data is proving trickier than I thought. I want to be able to drop my Backups.backupdb on it, and then burn all the (unique) data to DVDs for archiving. My entire backupsdb should fit on just a handful of DVDs if I can properly deconstruct it. Then from there on I could easily burn just the new entries in Time Machine to new DVDs.

Time Machine uses hard links, though, not symbolic links, so I'm not sure that find trick will work. Or will it? I can't find it documented, so I don't know the details. The man page for find lists a -f flag, but it's just to specify the directory to search. From the man page: "-f Specify a file hierarchy for find to traverse. File hierarchies may also be specified as the operands immediately following the options." Am I looking in the wrong place?

That gives me another idea, though. If I used find to return only items modified after the date of the previous backup entry, then that should (in theory) return only new files. Whether it would be significantly faster than my current method, I'm not sure. I assume 'find' traverses the entire directory structure much the same way I do currently, in which case probably not. Spotlight might do a better job, but I'd hate to require Spotlight — especially since I myself exclude my Backups.backupdb folder from Spotlight for performance!

For that matter, maybe the thing that's really slowing me down is Cocoa. Right now I'm using [NSFileManager enumeratorAtPath:], and managing autorelease pools manually. Maybe I ought to use low-level functions instead.

ElDiabloConCaca · Mar 25, 2008

Mikuro said:
That gives me another idea, though. If I used find to return only items modified after the date of the previous backup entry, then that should (in theory) return only new files.

I only see one problem with that -- hard links look just like the file itself to the operating system, and with Time Machine, new hard links to old files are created, so if you process those new hard links, you're processing old files.

I do believe there is a way to find out exactly how many hard links to a certain file there is... I think if you execute 'ls -L <pathtofile>' then that will tell you how many hard links point to that file... perhaps that could help in some way? I would highly recommend reading up on that option, though, before taking my word for it... hehe...

Mikuro · Mar 25, 2008

Yeah, I can see the number of links using 'stat' (I forget the exact options I need, but I know it's possible), but I can't see the locations of each one as far as I can tell.

I'm a little unclear on how hard links work. I was assuming that they all share the same creation/modification/etc. dates, regardless of when each link was made. So if I take a backup dated March 25, and look for only the items modified after March 24 (I'll need to be more precise than that, but you get the idea), it should work...in theory. I'm not thrilled with the idea, though, because it seems like there's a lot of chance for some weird problem I won't notice/consider, and it will be difficult to verify that the results are 100% accurate 'by hand'.

Also, it looks like the find command isn't all that fast to begin with. I guess I'm asking for the impossible here. Those links are there and it seems I can't pretend they're not — not without spending a lot of CPU time on the pretending, anyway!

ElDiabloConCaca · Mar 25, 2008

Well, I think that all hard links will "take on" the creation/modification date of the original file... so, if you have three hard links to one file, ALL the hard links will have identical creation/modification dates.

I think I have another suggestion -- 'ls -i' will give you the inode number for each file. You could then compare the inode numbers to make sure you don't "dupe" a file along the way.

Argh... at this point, I'm kinda grasping for ideas. I do believe that what you're trying to do is inherently "intensive" since you are doing a lot of "looking" at a lot of different files.

If I think of something magical, I'll post back, but I think we're bearing down on the conclusion that there are a number of ways to accomplish this, none of which will be lightning-quick, unfortunately.

Captain Code · Mar 25, 2008

Check out FSEvents. From what I remember you can get changes to the file system between two points from it so that might work. I also wrote some code to process a large number of files looking for only executable binary files and wound up finding some code that reads the first 4 bytes of the file and I can check that value against what it should be for executable code. It worked so much faster doing it that way then launching thousands of NSTasks.

I'm not sure how far back FSEvents would go so that might not work but I remember from WWDC you could query it for changes since your last query & other things like that. It might only work since your last query if the process hasn't been quit and relaunched but I don't really know just guessing.

Mikuro · Mar 25, 2008

I hadn't considered FSEvents. I remember hearing about them before Leopard came out. From a quick look through the docs, it looks like it's really meant for real-time monitoring of folders for changes, though. I'll take a closer look later.

I just ran a little test to see what the maximum speed I could expect would be when going through every file. I ran this:

Code:

find ./Backups.backupdb -print >> ~/Desktop/finddump

I let it run for 10 seconds to see how many files it could return. After those 10 seconds, the text file had 47,000 entries in it. Not bad at all. There are roughly 470,000 entries (files + folders) per snapshot, so that makes rough math very easy.

So, 100 seconds per snapshot, 50 snapshots (increasing by one per week). That's 80 minutes just to dump the path names of all those items, without any special processing. Since my processing will most likely take longer than the directory traversal itself, I guess I need to lower my standards down to "fast enough to complete overnight".

My program currently goes through an abysmal 1200 entries in 10 seconds, almost 40x slower. Of course, it's doing a fair amount of processing in there (creating a SQLite database for future use, and of course recording inodes), and it would get a lot faster after it finished the first snapshot (which would take about 65 minutes, at that rate). I'll have to test how [NSFileManager enumeratorAtPath:] performs when I don't do any special processing, so I can better compare it to find. If it turns out to be a bottleneck, I'll rewrite my program. If it's only a bit slower, I'll accept it and look for something else to optimize.

Thanks a lot for the ideas, guys.

Captain Code · Mar 25, 2008

I used NSFileManager's enumeratorAtPath as well. It's not bad performance but I didn't try as many files as you are. I tried it with my program by scanning my entire applications folder and XCode folder and it took a few minutes to complete and found over 5000 executable files. That's pretty decent I'd say.

Mikuro · Mar 31, 2008

I just noticed something interesting. In Disk Utility's "verify disk" routine, one of the steps is "checking multi-linked files". And this step completes rather quickly. Hmm. What exactly is it doing in that step, and how? Maybe it's nothing useful, but it caught my attention.

Edit: I just noticed that another step is "Checking multi-linked directories". Since you can't make hard links to directories (AFAIK), this makes me wonder if they mean something else entirely.

Mikuro · May 12, 2008

I'd put this project of mine on hold for a while, and I now realize that was for the best; apparently there is a perfectly fast way to do what I need. I'm not sure exactly what it is, but tms does it. It's a command-line tool for processing Time Machine backups. You can use it to, among other things, find the differences between two snapshots, and it does it virtually instantly.

Viro · May 13, 2008

Well, the initial processing of all the files will take a substantial amount of time. Assuming you have 500,000 unique files out of 20 million, you are still going to have to compare each of those unique files with each of the files. There's no way around that unless you know something about the snapshot (e.g. you short circuit the comparison by comparing the differences only, time machine stores a change log of sorts).

If tms is doing it instantly, it must be using some info about time machine snapshots that we're not aware off.

Mikuro · May 13, 2008

Yes, so it seems. I notice now that there's an invisible ".Backup.log" file in every snapshot folder. However, it doesn't contain data on individual files. It just says when it was started, how many files were copied, how much disk space was used, and any errors that occurred (on an unrelated note, I'm a little unnerved by how many files it says it could not copy since it never alerted me to the fact; fortunately it just seems to be apps, not personal files).

The good news is that it looks like I can just base my program on tms and save myself a lot of trouble.

Finding unique files within Backups.backupsdb

Mikuro

Crotchety UI Nitpicker

ElDiabloConCaca

U.S.D.A. Prime

Mikuro

Crotchety UI Nitpicker

ElDiabloConCaca

U.S.D.A. Prime

Mikuro

Crotchety UI Nitpicker

ElDiabloConCaca

U.S.D.A. Prime

Captain Code

Moderator

Mikuro

Crotchety UI Nitpicker

Captain Code

Moderator

Mikuro

Crotchety UI Nitpicker

Mikuro

Crotchety UI Nitpicker

Viro

Registered

Mikuro

Crotchety UI Nitpicker