Regular expression in dealing with datetime

Last week I encountered a medical dataset that has multiple types of date time format. My task is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.

Some of the variants are like this:

* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010
Python

There are several rules that I need to follow:

Assume all dates in xx/xx/xx format are mm/dd/yy
Assume all dates where year is encoded in only two digits are years from the 1900’s (e.g. 1/5/89 is January 5th, 1989)
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
Watch out for potential typos as this is a raw, real-life derived dataset.

Firstly we need to import a few modules and convert the txt file to pd.DataFrame:

import numpy as np
import dateutil.parser
import pandas as pd
import re

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
Python

Module dateutil.parser is used to convert strings to datetime objects.

Now we need to find out all date formats that belong to different categories. Regular expression is very handy for this kind of task

df = df.str.lower()
lst1 = df.str.findall(r'(\d{1,2}[-/]\d{1,2}[-/](?:\d{4}|\d{2}))')
lst2 =  df.str.findall(r'(\d{2}\s(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*(?:\.\s|,\s|\s)\d{4})')
lst3 = df.str.findall(r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*(?:-|\s|\.\s)\d{2}(?:,\s|\s|-)\d{4})')            
lst4 = df.str.findall(r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*(?:\.\s|,\s|\s)\d{4})')
lst5 = df.str.findall(r'(\d{1,2}/\d{4})')
lst6 = df.str.findall(r'((?:19|20)\d{2})')
Python

lst1 finds date in form like:

    [03/25/93]
     [6/18/85]
      [7/8/71]
     [9/27/75]
      [2/6/96]
Python

lst2 finds date form as:

    [24 jan 2001]
    [10 sep 2004]
    [26 may 1982]
   [28 june 2002]
    [06 may 1972]
Python

lst3 finds date form like:

      [april 11, 1990]
        [may 30, 2001]
        [feb 18, 1994]
   [february 18, 1981]
   [october. 11, 2013]
Python

lst4 finds date form as:

  [september 1985]
       [june 2011]
        [may 1986]
        [may 2016]
       [july 1977]
Python

lst5 is similar as lst4, both only contain year and month:

   [6/1998]
   [6/2005]
  [10/1973]
   [9/2005]
  [03/1980]
Python

lst6 finds dates that only have years:

  [1984]
  [2000]
  [2001]
  [1982]
  [1998]
Python

Remember we have some rules to follow, the final date form has to includedate,month,year. I defined specialized functions to transform dates in lst4, lst5 and lst6:

def add_m_d(value):
    return "January 1 "+value

def add_d1(value):
    return value[:-4]+'1/'+value[-4:]

def add_d2(value):
    return value[:-4]+' 1, '+value[-4:]

def add_19(value):
    if len(value) == 8:
        return value[:6]+'19'+value[6:]
    elif len(value) == 7:
        return value[:5]+'19'+value[5:]
    elif len(value) == 6:
        return value[:4]+'19'+value[4:]
    else:
        return value

def date_parser(value):
    return dateutil.parser.parse(value).strftime("%Y-%m-%d")
Python

Function date_parser is to convert all forms of date into single one Year-Month-Day. Then df.sort_values sorted all date from the earliest to the latest. The final form is as following:

     date
1971-04-10
1971-05-18
1971-07-08
1971-07-11
1971-09-12
...	...
2016-05-01
2016-05-30
2016-10-13
2016-10-19
2016-11-01
rows × 1 columns
Python