How To Create Large Number Of Entities In Cloud Datastore

My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:

1. Read a csv file line by line and create entity in the datstore. Issues: It works well but it timed out and cannot create all the entities in one go.

2. Uploaded all files in Blobstore and red them to datastore Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.

I am looking for a way to create above(50k+) large number of entities in datastore all in one go.



Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.

It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.

Task Queues

I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.

When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.

Caveat on catching DeadlineExceededError

Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.

If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:

  1. Switch to a manually scaled instance which gives you are 24 hour deadline.
  2. Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.