How Do I Increase CPU Utilization For Processing A Dataframe In For Loop?
I have a dataset of about 200,000 addresses that I want to geocode (i.e., find the latitudes and longitudes of). My (simplified) code to do this is as follows:
import pandas as pd import numpy as np df = pd.read_csv('dataset.csv') Latitudes = np.zeros(len(df)) Longitudes = np.zeros(len(df)) def geocode_address(address): ### The logic for geocoding an address ### and return its latitude and longitude for i in range(len(df)): try: lat, lon = geocode_address(df.Address[i]) except: lat = lon = '' Latitudes[i] = lat Longitudes[i] = lon
The problem is that each row (address) takes about 1-1.3 seconds to geocode, so this code will take at least a couple of days to finish running for the entire dataset. I am running this on a jupyter notebook in Windows 10. When I look at the task manager, I see that the process
jupyter.exeis taking only 0.3-0.7% of the CPU! That is surprisingly low I think. Am I looking at the wrong process? If not, how do I increase the CPU utilization to at least, say, 50% for this code, so that the code can finish running in a few minutes or hours instead of taking a couple of days?
You're barking at the wrong tree. Your code is not CPU-bound, it's IO-bound (there's no intensive computation going on, most of the time is spent doing HTTP requests).
The canonical solution to such problems is parallelization (you may want to have a look at the
multiprocessing module), and by itself it's quite easy to implement here since - BUT you'll still have to deal with your geocoding API rate limitations.
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module