Thursday, February 3, 2011

Copy 10 million images in a single folder to another server

Now I know you shouldn't ever put 10 million files into a single directory to begin with. Blame it on the developers, but as it stands that’s where I am at. We will be fixing it and moving them into folder groups, but first we gotta get them copied off of the production box.

I first tried rsync but it would fail out miserably. I assume it was because storing the name and path of the files in memory was greater than the ram and swap space.

Then I tried to compress it all into a tar.gz but it couldn't unzip it, file too large error (it was 60gigs).

I tried to just do a tar to tar exaction, but I got a "cannot open: file too large"

tar c images/ | tar x –C /mnt/coverimages/

Extra Info:

/mnt/coverimages/ is an nfs share where we want to move the images to.

All files are images

OS: Gentoo

  • If you install version 3+ of rsync it will do a rolling list of files to transfer and won't need to keep the entire file list in memory. In the future you probably want to consider hashing the filenames and creating a directory structure based on parts of those hashes.

    You can see this answer to get an idea of what I mean with the hashing.

  • If I could arrange the downtime I'd simple move the disk temporarily.

    matthew : or do a block level clone of the disk
    From Chopper3
  • I would imagine that your two options are to incrementally move the file set by using regexes to match filenames, or to unmount the filesystem and copy the entire device over.

    wow. Good luck.

    Luke has no name : Beat my answer by 20 seconds!
  • Segment your rsyncs or whatever tool you use into several jobs. If the files are named with some sort of logic or numbering, you could tell the system to rsync X number of files at a time.

    AndyN : I've used this technique in the past, using the file creation time as the segementing criterion. This avoids the need for logical or sequential filenames,and assumes the files were created over some length of time. Roughly: For ($interval = 1..large number) find ctime less than $interval ago; do something ; $interval++.
  • have you tried using find and -exec (or xargs), something like

    find images/ -exec cp "{}" /mnt/coverimages/ \;
    

    ?

  • I don't quite think that you have the "tar | tar" command quite right. Try this

    tar cf - images/ | cd /mnt/coverimages && tar xf -

    Another option would be to stream over SSH (some CPU overhead for encryption):

    tar cf - images/ | ssh user@desthost "cd /path/coverimages && tar xf -"

    There's also cpio, which a bit more obscure, but offers similar functionality:

    find images/ | cpio -pdm /mnt/coverimages/

    From DAM
  • If you already have a nfs mount, wouldn't a simple cp -au suffice?

    I do believe a tar-over-network (ssh or netcat) could be faster (less waiting for round trips).

    Also, investigate ionice beforehand - your production users will thank you

    From Joris
  • A note, classically NFS writes were synchronous, so performance would be better if you exported the dir you are reading from, and then wrote locally.

    I think the find solution is the best choice (hmmm, I wonder about using nohup or the like, and spawning a background process to speed things up a bit, but that may not be a win). Find means you are just walking the dir and dealing with each file as you find it, which means you are never thinking of the whole list of files.

  • There might be a way to separate all the files into separate directories and then build a single directory that contains only hard links to those files? That way your developers could access the files either way, as a single directory , or as separate directories (separated by month for example):

    /home/user/directory/file1-slink
    /home/user/directory/file2-slink
    
    /home/user/dir-month-07/file1
    /home/user/dir-month-06/file2
    
    From djangofan

0 comments:

Post a Comment