Ad

Apache Tika - Detect JSON / PDF Specific Mime Type

- 1 answer

I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. Unfortunately I don't have other info about the file (e.g. extension).

Is there something I can do to make Tika be more specific?

I'm currently using this:

Tika tika = new Tika();
tika.setMaxStringLength(-1);
String mimetype = tika.detect(Base64.decode(fileString));

and it gives me text/plain for JSON and PDF files, but I would like to obtain a more specific information: application/json, application/pdf etc...

Hope someone can help me!

Thanks.

Ad

Answer

Tika#detect(String)

Detects the media type of a document with the given file name.

Passing the content of a PDF or JSON file won't work as this method expects a filename. Tika will fallback to text/plain as it won't find any matching filenames.

PDF

For PDF, you just need to either write some of the data to a stream, or pass it some of the bytes and have Tika read that using Mime Magic Detection by looking for special ("magic") patterns of bytes near the start of the file (which in plain text is %PDF):

String pdfContent = "%PDF-1.4\n%\\E2\\E3\\CF\\D3"; // i.e. base64 decoded
Tika tika = new Tika();
System.out.println(tika.detect(pdfContent.getBytes())); // "application/pdf"

JSON

For JSON though, even this method will return text/plain & Tika is correct. application/json is like a subtype of plain text to indicate that the text should be interpreted differently. So that's what you'll have to do if you get text/plain. Use a JSON library (e.g. Jackson) to parse the content to see if it's valid JSON:

Sring json = "[1, 2, 3]"; // an array in JSON
try {
    final JsonParser parser = new ObjectMapper().getFactory().createParser(json);
    while (parser.nextToken() != null) {
    }
    System.out.println("Probably JSON!");
} catch (Exception e) {
    System.out.println("Definitely not JSON!");
}

Just be careful about how strict you want to be since Jackson treats a single number 1 as valid JSON but it's not really. To get round that, you could 1st of all test that the string starts with either { or [ (possibly preceded by whitespace) using something like json.matches("^\\s*[{\\[].*") before even attempting to parse it as JSON.

Here's a DZone tutorial for Jackson.

Ad
source: stackoverflow.com
Ad