Hi MortifiedCrow63
I have to admit this is very strange, I think the fact it works for the artifacts and not for the model is kind of a fluke ...
If you use "wait_on_upload" argument in the upload_artifact you end up with the same behavior. Even if uploaded in the background, the issue is still there, for me it was revealed the minute I limited the upload bandwidth to under 300kbps.It seems the internal GS timeout assumes every chunk should be uploaded in under 60 seconds.
The default chunk size is 100MB (I think), and anything below it is a single stream upload.
I'm not sure what's the right route to take here, should we externally configure GS package ? It seems like the GS package internal issue, and I'm not sure it is our place to fix it.
wdyt ?
BTW:
https://github.com/googleapis/python-storage/issues/263
https://github.com/googleapis/python-storage/issues/183
Just curious about the timeout, was it configured by clearML or the GCS? Can we customize the timeout?
I'm assuming this is GCS, at the end the actual upload is done GCS python package.
Maybe there is an env variable ... Let me google it
This one: https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
My change only adding output_uri
to use GCS path
Hi MortifiedCrow63
I finally got GS credentials, there is something weird going on. I can verify the issue, with model upload I get timeout error while upload_artifacts just works.
Just updating here that we are looking into it.
Hi AgitatedDove14 , any update on the bug of GCS timeout?
Noted AgitatedDove14 , so likely it’s about bandwidth issue. Let me try suggestion from the github first. Thanks man!
No worries AgitatedDove14 , thanks for helping me.
Just curious about the timeout, was it configured by clearML or the GCS? Can we customize the timeout?
Thanks AgitatedDove14 ,
I think so, Can we configure the timeout from ClearML interface?
(I’m assuming the upload could take longer).
Hi MortifiedCrow63
saw
, ...
By default ClearML
will only log the exact local place where you stored the file, I assume this is it.
If you pass output_uri=True
to the Task.init
it will automatically upload the model to the files_server and then the model repository will point to the files_server (you can also have any object storage as model storage, e.g. output_uri=s3://bucket
)
Notice you can also set it as default configuration (local or on the agent):
https://github.com/allegroai/clearml/blob/f46561629f1a7d4a05c7ae135de98db99439c989/docs/clearml.conf#L156
Internally we use blob.upload_from_file
it has a default 60sec timeout on the connection (I'm assuming the upload could take longer).
Could you test with the same file? Maybe timeout has something to do with the file size ?
Hi MortifiedCrow63
Sorry getting GS credentials is taking longer than expected 🙂
Nonetheless it should not be an issue (model upload is essentially using the same StorageManager internally)
noted AgitatedDove14 ,
just wondering why the behavior between auto logging and manual upload (using StorageManager
) can yield different results. Do you think we’re using different component here?
If the problem is coming from the GCS, the StorageManager
should also fail, right?
Maybe that's the issue :
https://github.com/googleapis/python-storage/issues/74#issuecomment-602487082
AgitatedDove14 already done that and it works, my tested command: manager.upload_file
ClearML version: 1.0.2
ClearML Server version: 1.0.0-93
I do not think this is the upload timeout, it makes no sense to me for GCP package (we do not pass any timeout, it's their internal default for the argument) to include a 60sec timeout for upload...
I'm also not sure where is the origin of the timeout (I'm assuming the initial GCP handshake connection could not actually timeout, as the response should be relatively quick, so 60sec is more than enough)
MortifiedCrow63 , hmmm can you test with manual upload and verify ?
(also what's the clearml version you are using)
Hi MortifiedCrow63 , thank you for pinging! (seriously greatly appreciated!)
See here:
https://github.com/googleapis/python-storage/releases/tag/v1.36.0
https://github.com/googleapis/python-storage/pull/374
Can you test with the latest release, see if the issue was fixed?
https://github.com/googleapis/python-storage/releases/tag/v1.41.0
That’s the question i want to raise too,
No file size limit
Let me try to run it myself
Thanks AgitatedDove14 , i missed that one.
Thanks for confirming AgitatedDove14 , any github issue that i can follow?
This looks exactly like the timeout you are getting.
I'm just not sure what's the diff between the Model autoupload and the manual upload.
That’s the question i want to raise too, is there any limit on the file size? the size actually ~32 Mb, just using your MNIST example
Can we raise the size limit?
The next question is about upload the model artifact using cloud storage.
I’m trying to use Google Cloud Storage to store my model checkpoint, however failed with following errors:
2021-05-12 18:51:53,335 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2021-05-12 18:51:53,335 - clearml.Task - INFO - Completed model upload to
2021-05-12 18:51:54,298 - clearml.Task - INFO - Finished uploading
it said the uploading process got timeout, but the next one said the uploading process is complete.
After checking the bucket, i found nothing (means the model is not yet uploaded).
Any idea about the timeout reason AgitatedDove14 ? I believe i already use the correct credentials and tested it manually using StorageManager SDK
Thanks