Python: A* Routing From Dataframe With Longitude And Latitude

I have a dataframe with 30,000 records in the following format:

ID | Name | Latitude | Longitude | Country |
1  | Hull | 53.744   | -0.3456   | GB      |

I would like to select one record to be the start location and one record to be the destination and return a path (list) for the shortest path.

I am using Geopy to find the distance between points in km

import geopy.distance

coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)

print (geopy.distance.vincenty(coords_1, coords_2).km)

I have read how to do A* in python from the following tutorial:

However they create a grid system to navigate through.

This is a visual representation of the records in the dataframe: enter image description here

This is the code I have so far however it fails to find a path:

def calcH(start, end):
    coords_1 = (df['latitude'][start], df['longitude'][start])
    coords_2 = (df['latitude'][end], df['longitude'][end])
    distance = (geopy.distance.vincenty(coords_1, coords_2)).km
    return distance

^Calculates the distance between points

def getneighbors(startlocation):
    neighborDF = pd.DataFrame(columns=['ID', 'Distance'])
    coords_1 = (df['latitude'][startlocation], df['longitude'][startlocation])
    for index, row in df.iterrows():
        coords_2 = (df['latitude'][index], df['longitude'][index])
        distance = round((geopy.distance.vincenty(coords_1, coords_2)).km,2)
        neighborDF.loc[len(neighborDF)] = [index, distance]
    neighborDF = neighborDF.sort_values(by=['Distance'])
    neighborDF = neighborDF.reset_index(drop=True)

    return neighborDF[1:5]

^Returns the 4 closest locations (ignoring itself)

openlist = pd.DataFrame(columns=['ID', 'F', 'G', 'H', 'parentID'])
closedlist = pd.DataFrame(columns=['ID', 'F', 'G', 'H', 'parentID'])

startIndex = 25479 # Hessle
endIndex = 8262 # Leeds

h = calcH(startIndex, endIndex)
openlist.loc[len(openlist)] = [startIndex,h, 0, h, startIndex]

while True:

#sort the open list by F score
openlist = openlist.sort_values(by=['F'])
openlist = openlist.reset_index(drop=True)

currentLocation = openlist.loc[0]
closedlist.loc[len(closedlist)] = currentLocation
openlist = openlist[openlist.ID != currentLocation.ID]

if currentLocation.ID == endIndex:

adjacentLocations = getneighbors(currentLocation.ID)

if(len(adjacentLocations) < 1):
    print("No Neighbors: " + str(currentLocation.ID))

for index, row in adjacentLocations.iterrows():
    if adjacentLocations['ID'][index] in closedlist.values:

    if (adjacentLocations['ID'][index] in openlist.values) == False:

        g = currentLocation.G + calcH(currentLocation.ID, adjacentLocations['ID'][index])
        h = calcH(adjacentLocations['ID'][index], endIndex)
        f = g + h
        openlist.loc[len(openlist)] = [adjacentLocations['ID'][index], f, g, h, currentLocation.ID]
        adjacentLocationInDF = openlist.loc[openlist['ID'] == adjacentLocations['ID'][index]] #Get location from openlist
        g = currentLocation.G + calcH(currentLocation.ID, adjacentLocations['ID'][index])
        f = g + adjacentLocationInDF.H
        if float(f) < float(adjacentLocationInDF.F):
            openlist = openlist[openlist.ID != currentLocation.ID]
            openlist.loc[len(openlist)] = [adjacentLocations['ID'][index], f, g, adjacentLocationInDF.H, currentLocation.ID]

if (len(openlist)< 1):
    print("No Path")

Finds the path from the closed list:

# return the path
pathdf = pd.DataFrame(columns=['name', 'latitude', 'longitude', 'country'])
def getParent(index):

    parentDF = closedlist.loc[closedlist['ID'] == index]
    pathdf.loc[len(pathdf)] = [df['name'][parentDF.ID.values[0]],df['latitude'][parentDF.ID.values[0]],df['longitude'][parentDF.ID.values[0]],df['country'][parentDF.ID.values[0]]]
    if index != startIndex:


Currently this implementation of A* isn't finding a complete path . Any suggestions?

Edit: I have tried increasing the number of considered neighbors from 4 to 10 and I got a path but not a optimum path:

enter image description here

We are trying to get from Hessle to Leeds.

enter image description here ^ available nodes

Raw Data: Link



I'm still not sure what's the problem with your appraoch, although there certainly are a few, as already mentioned in comments.

  • regarding only the closest four (or for that matter, any fixed number of) neighbors might lead to dead-ends or to certain portions of the graph being entirely cut off, e.g. isolated cities that are not within the "closest X" for any of their neighbors
  • your checks in the form x in dataframe.values will check whether x is any of the values in the numpy-array returned by values, not necessarily the ID field
  • using dataframes instead of a proper heap for the open-list and a hash-set for the closed-list makes the search unneccessarily slow, as you have to search and sort the entire lists all the time (not sure whether Pandas can speed up the lookup with indexing, but the sorting definitely takes time)

Anyhow, I found this to be an interesting problem and gave it a try. Turns out, though, that using dataframes as some kind of pseudo-heap was indeed very slow, and also I found the dataframe-indexing to be extremely confusing (and possibly error-prone?), so I changed the code to use namedtuple for data and a proper heapq heap for the openlist and a dict mapping nodes to their parents for the closedlist. Also, there are fewer checks than in your code (e.g. whether a node is already in the openlist) and those do not matter really.

import csv, geopy.distance, collections, heapq

Location = collections.namedtuple("Location", "ID name latitude longitude country".split())
data = {}
with open("stations.csv") as f:
    r = csv.DictReader(f)
    for d in r:
        i, n, x, y, c = int(d["id"]), d["name"], d["latitude"], d["longitude"], d["country"]
        if c == "GB":
            data[i] = Location(i,n,x,y,c)

def calcH(start, end):
    coords_1 = (data[start].latitude, data[start].longitude)
    coords_2 = (data[end].latitude, data[end].longitude)
    distance = (geopy.distance.vincenty(coords_1, coords_2)).km
    return distance

def getneighbors(startlocation, n=10):
    return sorted(data.values(), key=lambda x: calcH(startlocation, x.ID))[1:n+1]

def getParent(closedlist, index):
    path = []
    while index is not None:
        index = closedlist.get(index, None)
    return [data[i] for i in path[::-1]]

startIndex = 25479 # Hessle
endIndex = 8262 # Leeds

Node = collections.namedtuple("Node", "ID F G H parentID".split())

h = calcH(startIndex, endIndex)
openlist = [(h, Node(startIndex, h, 0, h, None))] # heap
closedlist = {} # map visited nodes to parent

while len(openlist) >= 1:
    _, currentLocation = heapq.heappop(openlist)

    if currentLocation.ID in closedlist:
    closedlist[currentLocation.ID] = currentLocation.parentID

    if currentLocation.ID == endIndex:
        for p in getParent(closedlist, currentLocation.ID):

    for other in getneighbors(currentLocation.ID):
        g = currentLocation.G + calcH(currentLocation.ID, other.ID)
        h = calcH(other.ID, endIndex)
        f = g + h
        heapq.heappush(openlist, (f, Node(other.ID, f, g, h, currentLocation.ID)))

This gives me this path from Hessle to Leeds, which seems more reasonable:

Location(ID=25479, name='Hessle', latitude='53.717567', longitude='-0.442169', country='GB')
Location(ID=8166, name='Brough', latitude='53.726452', longitude='-0.578255', country='GB')
Location(ID=25208, name='Eastrington', latitude='53.75481', longitude='-0.786612', country='GB')
Location(ID=25525, name='Howden', latitude='53.764526', longitude='-0.86068', country='GB')
Location(ID=7780, name='Selby', latitude='53.78336', longitude='-1.06355', country='GB')
Location(ID=26157, name='Sherburn-In-Elmet', latitude='53.797142', longitude='-1.23176', country='GB')
Location(ID=25308, name='Garforth Station', latitude='53.796211', longitude='-1.382083', country='GB')
Location(ID=8262, name='Leeds', latitude='53.795158', longitude='-1.549089', country='GB')

Even if you can't use this because you have to use Pandas (?), maybe this helps you finally spot your actual error.