Notes on Data Recovery

2016-03-06

Updated: 2016-03-08

Sometimes bad things happen to your data. Maybe you accidentally format a memory card, or a hard disk crashes.

Time for data recovery.

First, if you suspect mechanical failure, contact a data recovery specialist. They will hopefully be able to image the storage device and give you a big file with its contents, if not the files themselves.

1. A Success Story

Back in 2011 I shot some photos of my birthday dinner, but since it was my birthday and all, I just put the camera away and decided to do the import and post-processing some other day. The next day I had to test a new camera, a Nikon D3100. Since I didn't have a memory card for that camera, I just moved the card from my D40 - the one with the birthday shots on it - over to the D3100. Having totally forgotten about yesterday's photos, I pressed the "play" button, got the message "no photos in folder", decided that meant that the memory card was empty, and hit the "Format memory card" item in the menu.

Oops.

The D3100 was of course referring to the D3100 folder, while all my photos were in the now-obliterated D40 folder. A brief check confirmed that no, I really hadn't copied the photos off the camera.

Time for some technical self-support. A brief Googling turned up a bunch of small programs that claimed to be able to recover images - if you paid for it. Impressed as I was by this business method, it was just way below my dignity. So I wrote my own program. It managed to recover 842 of 862 potential files on the card for a success rate of 97.7%. Of those, the oldest one was from August 03, 2010! It had a shutter count of 32,822 in the EXIF data. The newest photo had 47,865. That old photo had survived 15,043 photos being taken after it, and God knows how many reformats. All birthday photos were recovered successfully.

2. Another Story

Recently I was contacted by a friend who had a broken portable HD with all his photos going back over ten years. To his credit, he had had the photos backed up in two places, but a series of disk failures had brought his down to one copy. Then that one broke as well - just as he was about to duplicate it - and he had zero copies left.

I managed to image part of the HD, and then spent some time looking for the program I used on my own data back in 2011. It was nowhere to be found. I found the draft of an old blog entry (from which the preceding section was taken) but no data recovery code, and no notes about what I did back then.

So I whipped up a recovery program and managed to recover 123 GB of photos and movies from the 160 GB I managed to image. It was far from everything, but it was a good start.

This time, I'll write down what I did.

3. Overview

These are some notes on how to recover data once you have successfully imaged the storage medium. The assumption is that we have access to the data from the affected storage device, but that for some reason the file system structure is lost. The technique boils down to scanning the data from start to end and identifying signs of the files that we're trying to recover. Once the start of a potential file has been found, we try to decode it. If the decode was successful, we know we've found a good file and we can store it somewhere safe. Otherwise, we keep scanning.

We therefore require some knowledge about the format of the files we are trying to recover.

4. General Program Structure

The program consists of a scanning loop that reads the image from start to end. When it finds a potential start, it hands over the current offset to a file format specific module that handles the verification and recovery.

try (PushbackInputStream is = ...) {
    long skip = 0;
    if (looks like the start of a JPEG) {
        skip = tryReocoverJPEG (current offset in image);
    }
    
    if (skip > 0) {
        // Successful recovery, skip ahead.
        is.skip (skip);
    }
}

The format-specific module, in this case tryRecoverJPEG, attempts to load a JPEG from the specified offset. If successful, it returns the size, in bytes, of the JPEG it recovered, otherwise -1:

long tryRecoverJPEG (long offset) throws Exception {
    InputStream is = ...;
    is.skip (offset);
    BufferedImage jpeg = ImageIO.read (is);
    if (jpeg != null) {
        return number of bytes read;
    } else {
        return -1;
    }
}

(The real module returns -1 on an exception being thrown by the ImageIO.read call, but you get the point.)

5. File Formats

With the above in mind, we can turn to identifying the beginning and end of various file types.

5.1. JPEG

JPEGs can be identified by their SOI (Start Of Image) marker. For even better discrimination, we can check that the next byte is 0xff and that the one following isn't an EOI marker, as a JPEG just consisting of a SOI followed by an EOI marker isn't valid. The test is therefore for the sequence 0xff, 0xd8, 0xff, (anything except 0xd9).

I ended up subclassing FilterInputStream to get an input stream that would count the number of bytes read, and then reading the image through it in order to figure out where the image ended. This turned out to be inexact - ImageIO would typically read about 64 kB more, probably due to internal buffering, so the following skip in the main loop would sometimes skip over the start of the next recoverable file in the image. I ended up running jpegtran -copy all with no parameters on the recovered JPEG. Ideally, one could walk the sequence of markers in the JPEG and stop at the EOI, but this seemed to work well enough.

5.2. QuickTime

For QuickTime movies I ended up looking for the character sequence "ftyp", which is an atom that indicates the file type. Once found, I would attempt to walk the sequence of atoms starting four bytes before the "ftyp".

Walking the QuickTime atom sequence is easy. Each atom has a field that specifies its size, so if you seek forward from one atom you should end up at the start of another. Testing the FourCC of the suspected atoms will tell you if you are at a valid atom or not. Stop when reaching an atom whose FourCC is all 0x0s.

5.3. AVI and RIFF

RIFF files are probably the easiest to recover. They start with the character sequence "RIFF", followed by a little-endian 32-bit length value, followed by a FourCC specifying the file type. If the FourCC makes sense, just copy as many bytes as the length value indicates.

5.4. NEF

Finding NEF files caused some initial despair before I figured out how to cheat. A NEF file is, like many RAW formats, a TIFF file [a]. Writing code to properly traverse a TIFF, however, is not high on my list of things I like to do - so I cheated.

A NEF file written by most Nikon cameras starts with the magic bytes 0x4d, 0x4d, 0x00, 0x2a. Once I found those, I would just copy 50 MB of data starting with the first 0x4d byte and then try to get image information from dcraw [b] using dcraw -i -v maybe.nef. If that worked, I assumed I had a bona fide NEF.

Finally I had to trim the fat from that 50 MB chunk. Again, I could have written code to traverse the TIFF structure and figure out the end - but since it were the people at Adobe that came up with the convoluted format, I thought it would be fair and just to let them do it for me: I ran the Adobe DNG converter [c] on the NEF file. It will only read as much as it has to, and will deliver an archival quality DNG file.

2016-03-06, updated 2016-03-08

Djurgårdsbrunnsviken at Night

Ältaån

#tech

Notes on Data Recovery

​1. A Success Story

​2. Another Story

​3. Overview

​4. General Program Structure

​5. File Formats

​5.1. JPEG

​5.2. QuickTime

​5.3. AVI and RIFF

​5.4. NEF