3 minutes
Regular expression in dealing with datetime
Last week I encountered a medical dataset that has multiple types of date time format. My task is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.
Some of the variants are like this:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010
There are several rules that I need to follow:
- Assume all dates in xx/xx/xx format are mm/dd/yy
- Assume all dates where year is encoded in only two digits are years from the 1900’s (e.g. 1/5/89 is January 5th, 1989)
- If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
- If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
- Watch out for potential typos as this is a raw, real-life derived dataset.
Firstly we need to import a few modules and convert the txt
file to pd.DataFrame
:
import numpy as np
import dateutil.parser
import pandas as pd
import re
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
Module dateutil.parser
is used to convert strings to datetime objects.
Now we need to find out all date formats that belong to different categories. Regular expression is very handy for this kind of task
df = df.str.lower()
lst1 = df.str.findall(r'(\d{1,2}[-/]\d{1,2}[-/](?:\d{4}|\d{2}))')
lst2 = df.str.findall(r'(\d{2}\s(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*(?:\.\s|,\s|\s)\d{4})')
lst3 = df.str.findall(r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*(?:-|\s|\.\s)\d{2}(?:,\s|\s|-)\d{4})')
lst4 = df.str.findall(r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*(?:\.\s|,\s|\s)\d{4})')
lst5 = df.str.findall(r'(\d{1,2}/\d{4})')
lst6 = df.str.findall(r'((?:19|20)\d{2})')
lst1
finds date in form like:
0 [03/25/93]
1 [6/18/85]
2 [7/8/71]
3 [9/27/75]
4 [2/6/96]
lst2
finds date form as:
125 [24 jan 2001]
126 [10 sep 2004]
127 [26 may 1982]
128 [28 june 2002]
129 [06 may 1972]
lst3
finds date form like:
194 [april 11, 1990]
195 [may 30, 2001]
196 [feb 18, 1994]
197 [february 18, 1981]
198 [october. 11, 2013]
lst4
finds date form as:
228 [september 1985]
229 [june 2011]
230 [may 1986]
231 [may 2016]
232 [july 1977]
lst5
is similar as lst4
, both only contain year and month:
343 [6/1998]
344 [6/2005]
345 [10/1973]
346 [9/2005]
347 [03/1980]
lst6
finds dates that only have years:
455 [1984]
456 [2000]
457 [2001]
458 [1982]
459 [1998]
Remember we have some rules to follow, the final date form has to includedate,month,year
. I defined specialized functions to transform dates in lst4, lst5 and lst6
:
def add_m_d(value):
return "January 1 "+value
def add_d1(value):
return value[:-4]+'1/'+value[-4:]
def add_d2(value):
return value[:-4]+' 1, '+value[-4:]
def add_19(value):
if len(value) == 8:
return value[:6]+'19'+value[6:]
elif len(value) == 7:
return value[:5]+'19'+value[5:]
elif len(value) == 6:
return value[:4]+'19'+value[4:]
else:
return value
def date_parser(value):
return dateutil.parser.parse(value).strftime("%Y-%m-%d")
Function date_parser
is to convert all forms of date into single one Year-Month-Day
. Then df.sort_values
sorted all date from the earliest to the latest.
The final form is as following:
date
9 1971-04-10
84 1971-05-18
2 1971-07-08
53 1971-07-11
28 1971-09-12
... ...
231 2016-05-01
141 2016-05-30
186 2016-10-13
161 2016-10-19
413 2016-11-01
500 rows × 1 columns