Find DateTime in the Text Using Python

Why do we need to find DateTime in the text? In the field of data science, we often have to deal with various kinds of data and one of the common is Text data but sometimes datetime in the text has to be extracted. Before jumping into a topic let's first start with a problem I recently encountered. I was given a task to extract sent and received email messages from a long thread of multipart emails. The email I used to get would be something like the below:


eml = """Re: Documents Received

John Doe <john@doe.org>
Wed, Jun 1, 2011, 9:39 PM
to Emma, Don, Bucky

Lorem
Ipsum
Dorem

On 01/06/2011, at 7:57 PM, "Emma" <emma@thompson.com> wrote:

Lorem Ipsum?

Thanks John

On 1 June 2011 13:43, Bucky Hallam <bucky@barnes.com> wrote:

Lorem Ipsum is Dorem.

Thanks Emma"""

The above text is modified from an original multipart email. My goal was to separate received and sent emails from one another. I have written a blog about how to retrieve emails as well, please give it a try if you are interested. I split the entire text by wrote: and then have to split parts again using On sent date. The first part was easy but due to different variants of dates, the second part got terribly hard. Some of the first emails I worked on had dates like On Sun 11, 2022, and I created a list like below

[f'On {d},' for d in ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']]

['On Sun,', 'On Mon,', 'On Tue,', 'On Wed,', 'On Thu,', 'On Fri,', 'On Sat,']

It worked for some but when sent dates were in a different format based on mail servers, this failed. Now there is a number of ways one could do it. But all are based on finding the pattern of DateTime in the text. Usually, DateTime in the text has patterns like YYYY/MM/DD HH:MM:SS, DD/MM/YYYY HH:MM:SS and so on we could prepare regex for that and find where it matched.

import re

Finding Date Using Regex

Format 1

Let's try to find the datetime in the text from the format YYYY/MM/DD without any time.

pattern = r'\d{4}/\d{2}/\d{2}'
txt = "This is 2022/11/11 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022/11/11', '2022/11/12']

This works well but not in the case when another format like - is used instead of /.

pattern = r'\d{4}/\d{2}/\d{2}'
txt = "This is 2022-11-11 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022/11/12']

It missed date here. We can simply use the or operator to add another format there.

pattern = r'(\d{4}-\d{2}-\d{2}|\d{4}/\d{2}/\d{2})'
txt = "This is 2022-11-11 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022-11-11', '2022/11/12']

Format 2

Let's use time too.

pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\d{4}/\d{2}/\d{2})'
txt = "This is 2022-11-11 14:23:19 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022-11-11 14:23:19', '2022/11/12']

It worked but not much in the cases as we have in email. But we can split our text based on a found date too and it's very useful above.

pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\d{4}/\d{2}/\d{2})'
txt = "This is 2022-11-11 14:23:19 and we are waiting for 2022/11/12."
print(re.split(pattern, txt, re.DOTALL))

['This is ', '2022-11-11 14:23:19', ' and we are waiting for ', '2022/11/12', '.']

Our Email

There are different formats of datetime in the text in our above email.

Wed, Jun 1, 2011, 9:39 PM
01/06/2011, at 7:57 PM
1 June 2011 13:43

And all of the above 3 requires different pattern as well so it's little tricky and more hard work to find them.

For Jun 1, 2011, 9:39 PM

pattern = r'([0-3]?[0-9], \d{4}, [0-2]?[0-9]:[0-5][0-9] [AaPp][Mm])'
txt = 'This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM'
print(re.findall(pattern, txt, re.DOTALL))

['1, 2011, 9:29 AM', '1, 2011, 19:39 PM']

For 01/06/2011, at 7:57 PM

pattern = r'([0-1]?[0-2]/[0-3]?[0-9]/\d{4}, at [0-2]?[0-9]:[0-5][0-9] [AaPp][Mm])'
txt = 'This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM'
print(re.findall(pattern, txt, re.DOTALL))

['01/06/2011, at 7:57 PM', '01/06/2011, at 19:57 PM']

For 1 June 2011 13:43

pattern = r'([0-1]?[0-2] \w{3,} \d{4} [0-2]?[0-9]:[0-5][0-9])'
txt = 'This is 1 June 2011 13:43'
print(re.findall(pattern, txt, re.DOTALL))

['1 June 2011 13:43']

But finding the pattern for each format is not a good solution. And there is not a golden pattern either. However, there are some Python packages that can help us in these cases.

Using `dateutil`

If this package is not installed, please do it by pip install dateutil.

from dateutil.parser import parse

By simply calling the parse method, we can get datetime.

parse('This is 1 June 2011 13:43', fuzzy_with_tokens=True)

(datetime.datetime(2011, 6, 1, 13, 43), ('This is ', ' '))

But this doesn't work always.

parse('This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM', fuzzy_with_tokens=True)

---------------------------------------------------------------------------

ParserError                               Traceback (most recent call last)

 in 
----> 1 parse('This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM', fuzzy_with_tokens=True)

C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs)
   1372         return parser(parserinfo).parse(timestr, **kwargs)
   1373     else:
-> 1374         return DEFAULTPARSER.parse(timestr, **kwargs)
   1375
   1376

C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    647
    648         if res is None:
--> 649             raise ParserError("Unknown string format: %s", timestr)
    650
    651         if len(res) == 0:

ParserError: Unknown string format: This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM

parse('This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM', fuzzy_with_tokens=True)

---------------------------------------------------------------------------

ParserError                               Traceback (most recent call last)

 in 
----> 1 parse('This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM')

C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs)
   1372         return parser(parserinfo).parse(timestr, **kwargs)
   1373     else:
-> 1374         return DEFAULTPARSER.parse(timestr, **kwargs)
   1375
   1376

C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    647
    648         if res is None:
--> 649             raise ParserError("Unknown string format: %s", timestr)
    650
    651         if len(res) == 0:

ParserError: Unknown string format: This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM

Using `dateparser`

I found this package to be more effective than dateutil. Please install it using pip install dateparser.

from dateparser.search import search_dates

search_dates(eml)

[('Wed, Jun 1, 2011, 9:39 PM', datetime.datetime(2011, 6, 1, 21, 39)),
 ('On 01/06/2011, at 7:57 PM', datetime.datetime(2011, 1, 6, 19, 57)),
 ('On 1 June 2011 13:43', datetime.datetime(2011, 6, 1, 13, 43))]

We can see that it found all the date times. And it also returns in the native python DateTime object. Isn't it awesome?

Using `datefinder`

This is another package that can find dates from the text. Please install it using pip install datefinder

!pip install datefinder

Collecting datefinder
  Downloading datefinder-0.7.3-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: pytz in c:\programdata\anaconda3\lib\site-packages (from datefinder) (2020.1)
Requirement already satisfied: python-dateutil>=2.4.2 in c:\programdata\anaconda3\lib\site-packages (from datefinder) (2.8.1)
Requirement already satisfied: regex>=2017.02.08 in c:\programdata\anaconda3\lib\site-packages (from datefinder) (2020.10.15)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.4.2->datefinder) (1.15.0)
Installing collected packages: datefinder
Successfully installed datefinder-0.7.3

from datefinder import find_dates

list(find_dates(eml))

[datetime.datetime(2011, 6, 1, 21, 39),
 datetime.datetime(2011, 1, 6, 19, 57),
 datetime.datetime(2011, 6, 1, 13, 43)]

This also gets our job done but we are more concerned about the original date format.

That's all for now and for my use case, I found date parser to be best. What is yours?

For more content like this one, please stay exploring our site or signup for the newsletter.

Find DateTime in the Text Using Python

Finding Date Using Regex

Format 1

Format 2

Our Email

For Jun 1, 2011, 9:39 PM

For 01/06/2011, at 7:57 PM

For 1 June 2011 13:43

Using `dateutil`

Using `dateparser`

Using `datefinder`

Related

Leave a ReplyCancel reply

Finding Date Using Regex

Format 1

Format 2

Our Email

For Jun 1, 2011, 9:39 PM

For 01/06/2011, at 7:57 PM

For 1 June 2011 13:43

Using dateutil

Using dateparser

Using datefinder

Share this:

Related

Leave a ReplyCancel reply

Using `dateutil`

Using `dateparser`

Using `datefinder`