Get Title And Year From File Name Using Regex

- 1 answer

How do I write a regex that gets title and, if available, year from filenames? See examples below.

This solution works for php but im having problem translating it into javascript Seprate movie name and year from moviefile name

 The Forbidden Girl 2013 BRRIP Xvid AC3-BHRG.avi
 How to train your dragon 2.mkv
 10,000BC (2010).1080p.avi


The solution provided below works for all the test cases you provided (and some extra as the titlelize, see the code below) and is intended to be customizable.

Long story short, try the snippet below:

// Live Test
var input  = document.getElementById('input');
var output = document.getElementById('output');
input.oninput = function() { output.textContent = extractData(input.value); }

// Samples
var tests = ['The.Great.Gatsby.2013.BluRay.1080p.DTS.x264-CHD.mkv', 'The Forbidden Girl 2013 BRRIP Xvid AC3-BHRG.avi', 'Pain.&.Gain.2013.720p.BluRay.DD5.1.x264-HiDt.mkv', 'Se7en.(1995).avi', 'How to train your dragon 2.mkv', '10,000BC (2010).1080p.avi', 'The.Great.Gatsby.BluRay.1080p.DTS.x264-CHD.mkv', 'Se7en.avi', '2001 A Space Odyssey.BluRay.1080p.DTS.x264-CHD.mkv','Sand.Castle.2017.FRENCH.1080.WEBRip.AAC2.0-NEWCiNE-WwW.Zone-Telechargement.Ws.mkv'];

while (t = tests.pop()) {
   document.getElementById('list').innerHTML += '<b>INPUT</b>: "' + t + '"<br>';
   document.getElementById('list').innerHTML += extractData(t,true) + '<hr>';

function titlelize(title) {
   return title.replace(/(^|[. ]+)(\S)/g, function(all, pre, c) { return ((pre) ? ' ' : '') + c.toUpperCase(); });

function extractData(it, html) {
   var regex = /^(.+?)[.( \t]*(?:(19\d{2}|20(?:0\d|1[0-9])).*|(?:(?=bluray|\d+p|brrip|webrip)..*)?[.](mkv|avi|mpe?g|mp4)$)/i;

   var out = '&#8627;&nbsp;';
   if ( m = regex.exec(it) ) {
       title = titlelize(m[1]) || '-'; year = m[2] || '-';
       out += '<font color="green"><b>Title</b>: "' + title + 
          '"&emsp; <b>Year</b>: "' + year + '"</font>';
   } else {
       out += '<font color="red">No match</font>';
   //the replace is an hack to remove html in live input text
   return (html) ? out : out.replace(/<[^>]+>|&[^;]+;/g,'');
<mark><b>Paste and Try!</b></mark> &rArr; <input id="input" type="text" size="70" />
<br>&#8627;&emsp;<span id="output" style="line-height:40px;">No Match</span>

<div id="list"></div>


Assuming the title is structured roughly as this:

Title* || [ Year* ] || [ Codec ] Extension
The fields enclosed in square brackets are optional (e.g [field1])
* : field saved

The key is to match everything as title till the last valid year found (valid years: 1900-2016) or till the file extension (structured as a dot plus 3 letters, simple to change if needed).

Exceptions: in the case where a film does not contain a valid year at all the section starting with (case insensitive) bluray or [0-9]+p (e.g. 720p, 1080p) or brrip is stripped from the title section.

Regex BreakoutRegex101 Demo

  (.+?)                           # Save title into group $1
  [.( \t]*                        # Remove some separators
  (?:                             # Non capturing group
    (19\d{2}|20(?:0\d|1[0-6])).*  # Save years (1900-2016) in $2
  |                               # OR
    (?:(?=bluray|\d+p|brrip)..*)? # Match string starting with bluray,brrip,720p...
    [.](mkv|avi|mpe?g)$)          # Match extension (.mkv,.avi.,mpeg) add your own
/i                                # make the regex case insensitive

Regular expression visualization

Regex Customization

The list of exceptions and extensions can be easily filled with new values little by little during tests when/if needed (as file extension, e.g. to add .wmv and .flv add them to the (mkv|avi|mpe?g|wmv|flv) section of the regex) or to make the section generic replace it with [.]\w{3,4}$.