I'm looking for a powerful and fast way to handle processing of large file in Google App Engine.
It works as the following (simplified workflow at the end):
Uploads
with the CSV name, file path (to Google Storage) and some basic informations. Then, a Task is created, called "pre-processing".UploadEntries
model, for each line, with the CSV id, the line, the data to extract/treat, and some indicators (boolean) on if this line has started processing, and ended processing ("is_treating", "is_done")Uploads.next()
is made. The next
method will :
UploadEntries
that has is_treating
and is_done
at false,Process-healthcheck
(This task is runned after 5 minutes and checks that 7) has been correctly executed. If not, it considers that the Redis/Outside server has failed and do the same as 7), without the result ("error" instead)).UploadEntries.is_treating
to True for that entry.UploadEntries
entry in the datastore (including "is_treating
" and "is_done
"), and call Uploads.next()
to start the next line.post-process
that will rebuild the CSV with the treated data, and returns it to the customer.Here's a few things to keep in mind :
Uploads.next()
methods contains a limit
argument that let me search for n
process in parallel. Can be 1, 5, 20, 50.pre-processing
task directly to Redis becase in that case, the next customer will have to wait for the first file to be finished processing, and this will pile-up to take too longBut this system has various issues, and that's why I'm turning to your help :
Uploads.next()
, the entries returned are already being processed (it's just that entry.is_treating = True
is not yet pushed to the database)is_done = True
. That's why I had to implement a Healcheck system, to ensure the line is correctly treated no matter what. This has a double advantage : The name of that task contains the csv ID, and the line. Making it unique per file. If I the datastore is not up to date and the same task is run twice, the creation of the healthcheck will fail because the same name already exist, letting me know that there is a concurrence issue, so I ignore that task because it means the Datastore is not yet up to date.I initiall thought about running the file in one independant process, line by line, but this has the big disadvantage of not being able to run multiple line in parallel. Moreover, Google limits the running of a task to 24h for dedicated targets (not default), and when the file is really big, it can run for longer than 24h.
For information, if it helps, I'm using Python
And to simplify the workflow, here's what I'm trying to achieve in the best way possible :
I'd really appreciate if someone had a better way of doing this. I really believe I'm not the first to do that kind of work and I'm pretty sure I'm not doing it correctly.
(I believe Stackoverflow is the best section of Stack Exchange to post that kind of question since it's an algorithm question, but it's also possible I didn't saw a better network for that. If so, I'm sorry about that).
The servers that does the real work are outside of Google AppEngine
Have you considered using Google Cloud Dataflow for processing large files instead? It is a managed service that will handle the file splitting and processing for you.
Based on initial thoughts here is an outline process:
BlockingDataflowPipelineRunner
) to launch the dataflow task. (I'm afraid it needs to be a compute instance because of sandbox and blocking I/O issues).Controller Advice bean not instantiated at proper order in Spring Boot 2.4
What should I enter to the connection string to connect my NodeJS application to a MongoDB server?
Unable to use Computed Property Values with Dots - Unable to Set as String - JS
I am trying to evaluate the density of multivariate t distribution of a 13-d vectorUsing the dmvt function from the mvtnorm package in R, the result I get is
Given a pandas dataframe in the following format:
I found come old TCP reverse shell made in python online and im editing it to fix it up, when it prints the output of the command it doesnt print it all if the command is too bigThis is the code for sending the commands and then printing the output
I'm following a turtorial on LDA and encountering a problem since the turtorial is made in python 3 and I'm working in 27 (the turtorial claims to work in both)