Mikuro
Crotchety UI Nitpicker
I'm writing a program that processes data in Time Machine's Backups.backupsdb folder. The important thing is that there's no reason for me to process the same file twice, and Time Machine stores multiple links to the same files one in every single incremental backup.
I can test whether a file has already been processed by recording the inode numbers of every file and comparing new files against the list (I use an NSMutableSet for this task). The problem is that even just going through the directories, without doing any significant processing of the duplicate links, takes an obscene amount of time.
I estimate that there are about 20 million files (and 3.5 million folders) in my backupsdb folder, but only about 500,000 unique files. Going through all those duplicates is just not reasonable. It would literally take all day to process the folder.
Is there any FAST way to ignore duplicate links? Maybe by accessing files directly by inode number instead of path? Is there any way to get a list of all paths pointing to a given inode, or of all inodes within a given directory?
I can't think of any way to get this done. Any ideas?
I can test whether a file has already been processed by recording the inode numbers of every file and comparing new files against the list (I use an NSMutableSet for this task). The problem is that even just going through the directories, without doing any significant processing of the duplicate links, takes an obscene amount of time.
I estimate that there are about 20 million files (and 3.5 million folders) in my backupsdb folder, but only about 500,000 unique files. Going through all those duplicates is just not reasonable. It would literally take all day to process the folder.
Is there any FAST way to ignore duplicate links? Maybe by accessing files directly by inode number instead of path? Is there any way to get a list of all paths pointing to a given inode, or of all inodes within a given directory?
I can't think of any way to get this done. Any ideas?