Interpolate Specific Entries In DataFrame Depending On Groups
I am having an issue where I have AIS data of several trips from Rotterdam to Hamburg. The route is split into 6 sectors, with the sector borders pre-defined for the route, and I need to know where and when a ship entered the next sector. I tried just using the last record within a sector, but the data's resolution isn't high enough for that. So I would like to interpolate the time and longitude based on the latitude of the sector border.
You can see the borders I decided on for this trip in the following image. The longitude of crossing the border is always exactly on the border line. What I need to determine is the latitude at which this line is crossed by a ship.
My DataFrame looks like this:
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
where the border between sectors 1 and 2 is pre-defined at longitude 4.8, so I would like to interpolate both the latitude and time at longitude 4.8 for each trip and sector border. I'm guessing a good solution will involve something like df.groupby(['TripID', 'SectorID'])
.
I tried adding an entry for each trip and sector that only has the Latitude of the sector's border in it and then using interpolate
, but adding the entries takes about an hour for me and interpolating the missing values crashes right away.
The result I'm looking for should look like this:
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
8 42 9.5 53.05 4.8 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
9 5 10.3 53.06 4.8 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
I would also be happy about and able to work with a result that looks like this:
TripID SectorID leave_lat leave_lon leave_time
42 1 53.05 4.8 9.5
5 1 53.06 4.8 10.3
Please ask, if my description of the problem isn't quite clear.
Answer
Since the usual pandas crew hasn't spotted this nice question, I give you a solution with some caveats. Here is the sample input, I used:
TripID time Latitude Longitude
42 7 52.9 4.4
42 8 53.0 4.6
42 9 53.0 4.7 * missing value
42 10 53.1 4.9
42 11 53.2 4.9
42 12 53.3 5.3 * missing value
42 15 53.7 5.6
5 9 53.0 4.5
5 10 53.0 4.7 * missing value
5 11 53.2 5.0
5 12 53.4 5.2
5 14 53.6 5.3 * missing value
5 17 53.4 5.5
5 18 53.3 5.7
34 19 53.0 4.5
34 20 53.0 4.7
34 24 53.9 4.8 ** value already exists
34 25 53.8 4.9
34 27 53.8 5.3
34 28 53.8 5.3 * missing value
34 31 53.7 5.6
34 32 53.6 5.7
This code:
import numpy as np
import pandas as pd
#import data
df = pd.read_csv("test.txt", delim_whitespace=True)
#set floating point output precision to prevent excessively long columns
pd.set_option("display.precision", 2)
#remember original column order
cols = df.columns
#define the sector borders
sectors = [4.8, 5.4]
#create all combinations of sector borders and TripIDs
dfborders = pd.DataFrame(index = pd.MultiIndex.from_product([df.TripID.unique(), sectors], names = ["TripID", "Longitude"])).reset_index()
#delete those combinations of TripID and Longitude that already exist in the original dataframe
dfborders = pd.merge(df, dfborders, on = ["TripID", "Longitude"], how = "right")
dfborders = dfborders[dfborders.isnull().any(axis = 1)]
#insert missing data points
df = pd.concat([df, dfborders])
#and sort dataframe to insert the missing data points in the right position
df = df[cols].groupby("TripID", sort = False).apply(pd.DataFrame.sort_values, ["Longitude", "time", "Latitude"])
#temporarily set longitude as index for value-based interpolation
df.set_index(["Longitude"], inplace = True, drop = False)
#interpolate group-wise
df = df.groupby("TripID", sort = False).apply(lambda g: g.interpolate(method = "index"))
#create sector ID column assuming that longitude is between -180 and +180
df["SectorID"] = np.digitize(df["Longitude"], bins = [-180] + sectors + [180])
#and reset index
df.reset_index(drop = True, inplace = True)
print(df)
produces the following output:
TripID time Latitude Longitude SectorID
0 42 7.00 52.90 4.4 1
1 42 8.00 53.00 4.6 1
2 42 9.00 53.00 4.7 1
3 42 9.50 53.05 4.8 2 * interpolated data point
4 42 10.00 53.10 4.9 2
5 42 11.00 53.20 4.9 2
6 42 12.00 53.30 5.3 2
7 42 13.00 53.43 5.4 3 * interpolated data point
8 42 15.00 53.70 5.6 3
9 5 9.00 53.00 4.5 1
10 5 10.00 53.00 4.7 1
11 5 10.33 53.07 4.8 2 * interpolated data point
12 5 11.00 53.20 5.0 2
13 5 12.00 53.40 5.2 2
14 5 14.00 53.60 5.3 2
15 5 15.50 53.50 5.4 3 * interpolated data point
16 5 17.00 53.40 5.5 3
17 5 18.00 53.30 5.7 3
18 34 19.00 53.00 4.5 1
19 34 20.00 53.00 4.7 1
20 34 24.00 53.90 4.8 2
21 34 25.00 53.80 4.9 2
22 34 27.00 53.80 5.3 2
23 34 28.00 53.80 5.3 2
24 34 29.00 53.77 5.4 3 * interpolated data point
25 34 31.00 53.70 5.6 3
26 34 32.00 53.60 5.7 3
Now to the caveats. I do not know, how to add the missing rows inplace. I will ask a question, how to do this. If I get an answer, I will update mine here. Until then, the side effect is that the table is sorted within each TripID
for Longitude
and the assumption is that Longitude
does not decrease, which in reality might not be the case.
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module