How to handle duplicate dates in dates ranges in python

52
October 15, 2021, at 06:40 AM

Imagine an employee that has worked for different companies and you want to know how many months he worked for each company. The problem is that he may have worked for two companies, overlapping some time (case of companies D and C). This overlapping time shall not be included. Another thing is that in the each month he worked, it counts as one month, no matter how long he worked, for example, in company A, duration shall be 3 months (as he worked in months 1, 2 and 3), although the real duration would be 2 months.

    Company    Start              End       Duration (months)
       A      2021-01-01       2021-03-01
       B      2021-03-03       2021-06-07
       C      2021-06-10       2021-08-28
       D      2021-04-10       2021-10-02

I thought of something like this:

duration_days = end - start # dates are datetime object
duration_months = duration_days.days/30 # not sure if it is the right way (months with 28 days?)
day = start
if day<= end:
    duration_months += 1
    day = day + relativedelta(months=+1)
df.append()

I did not figure out how to remove the duplicates and I am not sure if my approach above is the best.

Thanks

Answer 1

You could use the dateutil package and dateutil.relativedelta:

import io
import pandas as pd
import calendar
from datetime import datetime
from dateutil.relativedelta import relativedelta
data = io.StringIO("""
Company,Start,End
A,2021-01-01,2021-03-01
B,2021-03-03,2021-06-07
C,2021-06-10,2021-08-28
D,2021-04-10,2021-10-02
""")
df = pd.read_csv(data)
def to_datetime(date):
    return datetime.strptime(date, "%Y-%m-%d")
df["Start"] = df["Start"].apply(to_datetime)
df["End"] = df["End"].apply(to_datetime)
def get_duration(start, end):
    left = df[(df["End"] > start) & (df["Start"] < start)]
    right = df[(df["Start"] > start) & (df["End"] > end)]
    overlaps = df[(df["Start"] < start) & (df["End"] > end)]
    within = df[(df["Start"] > start) & (df["End"] < end)]
    
    if not left.empty:
        start = left["End"].max()
    if not right.empty:
        end = right["Start"].min()
    
    if not overlaps.empty:
        return 0
    
    if not within.empty:
        ...  # still to be solved
    
    last_day_in_month = calendar.monthrange(end.year, end.month)[1]
    end = end.replace(day=last_day_in_month)
    start = start.replace(day=1)
    return relativedelta(end, start).months
df["Duration"] = df.apply(lambda row: get_duration(row["Start"], row["End"]), axis=1)

Output for print(df) would be:

  Company      Start        End  Duration
0       A 2021-01-01 2021-03-01         2
1       B 2021-03-03 2021-06-07         1
2       C 2021-06-10 2021-08-28         0
3       D 2021-04-10 2021-10-02         4

The problem is that he may have worked for two companies, overlapping some time (case of companies D and C). This overlapping time shall not be included.

I read this like: No time is considered while working for 2 or more companies. NB. The example above handles 3 of 4 overlapping cases.

However, I assume you want the time allocated once. In that case, you may be able to achieve this by only handling two overlapping cases:

def get_duration(start, end):
    right = df[(df["Start"] > start) & (df["End"] > end)]
    overlaps = df[(df["Start"] < start) & (df["End"] > end)]
    
    if not right.empty:
        end = right["Start"].min()
    
    if not overlaps.empty:
        return 0
    
    last_day_in_month = calendar.monthrange(end.year, end.month)[1]
    end = end.replace(day=last_day_in_month)
    start = start.replace(day=1)
    return relativedelta(end, start).months

Output for print(df) would be:

  Company      Start        End  Duration
0       A 2021-01-01 2021-03-01         2
1       B 2021-03-03 2021-06-07         1
2       C 2021-06-10 2021-08-28         0
3       D 2021-04-10 2021-10-02         6

Another thing is that in the each month he worked, it counts as one month, no matter how long he worked [...]

You can achieve this with:

last_day_in_month = calendar.monthrange(end.year, end.month)[1]
end = end.replace(day=last_day_in_month)
start = start.replace(day=1)
months = relativedelta(end, start).months

Note: Depending on your data and the amount of the data, the example above may not perform well. However, hopefully my answer gives you an idea.

Answer 2

I suggest making a new variable for each date and then when your done append all of the variables to a list! Like this:

duration_days = end - start # dates are datetime object
duration_months = duration_days.days/30
day = start
if day<= end:
    duration_months += 1
    day = day + relativedelta(months=+1)
    date1 = (duration_months)
    date2 = (duration_months)
    date3 = (duration_months)
df.append(date1, date2, date3)

Tell me if that works!

READ ALSO
Add foldable sections to HTML email?

Add foldable sections to HTML email?

When looking for how webpages can make foldable sections, usually they dive straight into JavaScript and customizing the design

45
Cannot load drawable in image view Android Studio

Cannot load drawable in image view Android Studio

I am trying to capture a url from a EditText, download the image with that url and set it to an ImageViewI don't know why but it doesn't show any image when I run the program

75
React Framer-motion: Trying to add animation based on navigating back or forward

React Framer-motion: Trying to add animation based on navigating back or forward

I have made a replica of my component in the code sandbox link below:

61
window.print() not working on android devices

window.print() not working on android devices

I have prepared a method that prints the element , but it only works properly on the Windows platform and does not work on the Android platformAfter searching, I realized that Google Cloud Print can be used, but I realized that it is no longer supported

46