3 minutes
Regular expression in dealing with datetime
Last week I encountered a medical dataset that has multiple types of date time format. My task is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.
Some of the variants are like this:
There are several rules that I need to follow:
- Assume all dates in xx/xx/xx format are mm/dd/yy
- Assume all dates where year is encoded in only two digits are years from the 1900’s (e.g. 1/5/89 is January 5th, 1989)
- If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
- If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
- Watch out for potential typos as this is a raw, real-life derived dataset.
Firstly we need to import a few modules and convert the txt
file to pd.DataFrame
:
Module dateutil.parser
is used to convert strings to datetime objects.
Now we need to find out all date formats that belong to different categories. Regular expression is very handy for this kind of task
lst1
finds date in form like:
lst2
finds date form as:
lst3
finds date form like:
lst4
finds date form as:
lst5
is similar as lst4
, both only contain year and month:
lst6
finds dates that only have years:
Remember we have some rules to follow, the final date form has to includedate,month,year
. I defined specialized functions to transform dates in lst4, lst5 and lst6
:
Function date_parser
is to convert all forms of date into single one Year-Month-Day
. Then df.sort_values
sorted all date from the earliest to the latest.
The final form is as following: