Why Using Database (redis, SQL) Would Help When Loading Big Data And RAM Is Running Out Of Memory?
I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB in total. Then I need to order the dictionary with respect to the key and at the end take only the values of this ordered dictionary and save it as one big numpy array. I need this big numpy array because I want to sent it to train_test_split sklearn function and split the whole data to train and test sets with respect to their label. I found this question where they have the same problem with running out of RAM in the step where after creating the dictionary of 19GB I try to sort the dict: How to sort a LARGE dictionary and people suggest using database.
def save_all_images_as_one_numpy_array():
data_dict = {}
for img in os.listdir('images'):
id_img = img.split('_')[1]
loadimg = load_img(os.path.join('images', img))
x = image.img_to_array(loadimg)
data_dict[id_img] = x
data_dict = np.stack([ v for k, v in sorted(data_dict.items(), key = lambda x: int(x[0]))])
mmamfile = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='w+',shape=data_dict.shape)
mmamfile[:] = data_dict[:]
def load_numpy_array_with_images():
a = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='r')
When using np.stack I am stacking each numpy array in new array and this is where I run out of RAM. I can't afford to buy more RAM. I thought I can use redis in docker container but I don't understand why and how using a database will solve my problem?
Answer
The reason using a DB helps is because the DB library stores data on the hard-disk rather than in memory. If you look at the documentation for the library the linked answer suggests then you'll see that the first argument is filename, demonstrating that the hard-disk is used.
https://docs.python.org/2/library/bsddb.html#bsddb.hashopen
However, the linked question is talking about sorting by value, not key. Sorting by key will be much less memory intensive although you'll likely still have memory issues when training your model. I'd suggest trying something along the lines of
# Get the list of file names
imgs = os.listdir('images')
# Create a mapping of ID to file name
# This will allow us to sort the IDs then load the files in order
img_ids = {int(img.split('_')[1]): img for img in imgs}
# Get the list of file names sorted by ID
sorted_imgs = [v for k, v in sorted(img_ids.items(), key=lambda x: x[0])]
# Define a function for loading a named img
def load_img(img):
loadimg = load_img(os.path.join('images', img))
return image.img_to_array(loadimg)
# Iterate through the sorted file names and stack the results
data_dict = np.stack([load_img(img) for img in sorted_imgs])
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module