Why Using Database (redis, SQL) Would Help When Loading Big Data And RAM Is Running Out Of Memory?

- 1 answer

I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB in total. Then I need to order the dictionary with respect to the key and at the end take only the values of this ordered dictionary and save it as one big numpy array. I need this big numpy array because I want to sent it to train_test_split sklearn function and split the whole data to train and test sets with respect to their label. I found this question where they have the same problem with running out of RAM in the step where after creating the dictionary of 19GB I try to sort the dict: How to sort a LARGE dictionary and people suggest using database.

def save_all_images_as_one_numpy_array():
    data_dict = {}
    for img in os.listdir('images'):
        id_img = img.split('_')[1]
        loadimg = load_img(os.path.join('images', img))
        x = image.img_to_array(loadimg)
        data_dict[id_img] = x

data_dict = np.stack([ v for k, v in sorted(data_dict.items(), key = lambda x: int(x[0]))])
mmamfile = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='w+',shape=data_dict.shape)
mmamfile[:] = data_dict[:]

def load_numpy_array_with_images():
    a = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='r')

When using np.stack I am stacking each numpy array in new array and this is where I run out of RAM. I can't afford to buy more RAM. I thought I can use redis in docker container but I don't understand why and how using a database will solve my problem?



The reason using a DB helps is because the DB library stores data on the hard-disk rather than in memory. If you look at the documentation for the library the linked answer suggests then you'll see that the first argument is filename, demonstrating that the hard-disk is used.

However, the linked question is talking about sorting by value, not key. Sorting by key will be much less memory intensive although you'll likely still have memory issues when training your model. I'd suggest trying something along the lines of

# Get the list of file names
imgs = os.listdir('images')

# Create a mapping of ID to file name
# This will allow us to sort the IDs then load the files in order
img_ids = {int(img.split('_')[1]): img for img in imgs}

# Get the list of file names sorted by ID
sorted_imgs = [v for k, v in sorted(img_ids.items(), key=lambda x: x[0])]

# Define a function for loading a named img
def load_img(img):
    loadimg = load_img(os.path.join('images', img))
    return image.img_to_array(loadimg)

# Iterate through the sorted file names and stack the results
data_dict = np.stack([load_img(img) for img in sorted_imgs])