4 min read

How to secure (a bit) your file uploads?

How to secure (a bit) your file uploads?

At JobNinja, we're dealing with loooooots of files on a daily basis.

Mostly, they are innofensive and consist of simple CVs saved in PDF.

However, once in while, comes a corrupted or infected file and that's pretty much a drama for our customers.

Why does such a problem happen?

Files are pretty complicated to handle for your OS: because of their constantly growing variety, your OS does not really know what the file is and it relies on an extension (for instance the "PDF" extension -> example.pdf) to associate this file with a program. And this problem is the same for the browsers: they set a so called application/pdf mimetype to a file only because the file ends by .pdf.

Where is the leak?

I'll teach you a "hack", take a random file, let's say a picture and rename it mySuperCoolPicture.pdf. If you try to open it with your image viewer it might not work because an image viewer may be not designed to open PDF. However, this is not a PDF, the file is the same we just changed the name of the box!

pdf-jpeg

What was the problem for JobNinja?

We expect to deliver quality for the whole process, so we have kind of an issue when our customers cannot open the file we sent them per EMail (because the color of the box is not the same as the color of the file).

What did we do?

I found a cool lib and started to play around.

Problems

ODT

There was no support for ODT files (that we're supporting because we support OpenSource!)

That was pretty easy: JobNinja contributed to the library to make it working!

Microsoft Word Documents (.DOC)

Before going further, I need to explain how works the library I mentioned.

That's fairly simple: most of the files have so called Magic Codes: these are trigger code written in binary which are same for all the files sharing the same format and they tell computers: "Hey Dude, what you're going to read is a JPG" (for instance)

Magic code on JPG

That's exactly how this library is working: it reads the file in binary and try to find these magic codes:

For instance, to detect JPG the library has such a code:

if (check([0xFF, 0xD8, 0xFF])) {
    return {
        ext: 'jpg',
        mime: 'image/jpeg'
    };
}

The issue is then that MICROSOFT DOESN'T USE SUCH A TRICK: pretty much all the old Microsoft files (doc, xls, ppt, msi) start with 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1 but that's it. There is no way to detect if the document is Word or Excel without reading the whole doc...

There is one reason though: docs can contain xls and ppt or ppt can contain xls and so on. However it doesn't really explain the choice of Microsoft not to put a specific magic code.

Two extra stuff there:

  • The file-type library behaves wrong on old Microsoft files: I reopened a PR.

  • If the detection of doc is something interesting for you: here is a great thread about that.

What I ended up with

So I found no "client-side" solution for the .doc problem but all other formats are working (we have extra checks on our server side to ensure that a doc is a doc, see olefile)

Browser problem

As Sindre Sorhus (the creator of file-type and a great developer) is a NodeJS developer first and finds the world of browser over-complex. Therefore, the plugin was designed in priority for NodeJS and is not extremly browser friendly.

However it can be adapted and here is how I did:

// I need to make this function async because the FileReader is an async API
export function isCvValid(file){
  return new Promise((resolve, reject)=> {
    var reader = new FileReader();
    const announcedType = file.type;
  
    reader.onloadend = (evt) => {
      // Is loading done
      if (evt.target.readyState == FileReader.DONE){
        const decoded_file = evt.target.result;
        const toBytes = s => Array.from(s).map(c => c.charCodeAt(0));
        const result = filetype(toBytes(decoded_file));
        if (!result || result.mime !== announcedType){
          reject(false);
        }
        else{
          resolve();
        }
      }
      // Is file empty
      if(evt.target.readyState == FileReader.EMPTY){
        reject('Empty');
      }
    }
    // Lib only needs first 4100 bytes of the file
    // It drastically speeds up the process
    var blob = file.slice(0, 5000);
    reader.readAsBinaryString(blob);
  })
}

Notes:

So this function is the core of our verification: if the "announced" mimetype (the type of file that I can derivate from the extension: pdf for example.pdf) is different from the one lib detects by reading the binary, then chances are pretty high that the file is not legit.

Fake File