Reputation
Badges 1
25 × Eureka!Hi CurvedDolphin95
I would first check the free space on the instance (it might be that git is reporting an inaccurate error and it's free space not permission that causing it to fail the clone).
I would also check your GitHub account, notice that the now only support user/api-key (and not user/pass), which means you need to create an api-key and add it as your password in the clearml.conf.
Any chance that for some reason some of the Tasks are running from a diff user? or not using a docker ?
Hi UnevenDolphin73
In theory it "might" work, I have to admit that personally I'm not a fan of what Amazon did to Mongo, i.e. forking their their code base and selling it as a service, just bad open-source practice
(The main issue might be API calls that might not fully match)
wdyt?
Hi ScantChimpanzee51
btw: this seems like an S3 internal error
https://github.com/boto/s3transfer/issues/197
Out of curiosity, if Task flush worked, when did you get the error, at the end of the process ?
So without the flush I got the error apparently at the very end of the script -
Yes... it's a python thing, background threads might get killed in random order, so that when one needs a background thread that died you get this error, which basically should mean you need to do the work in the calling thread.
This actually explains why calling Flush solved the issue.
Nice!
Hi DilapidatedDucks58
eg, we want max validation accuracy and all other metric values for the corresponding epoch
Is this the equivalent of nested sort ?
Wouldn't you get the requested behavior if you add all metric columns but sort based on the "accuracy" column ?
Okay, I think I lost you...
DilapidatedDucks58 you mean detect at which "iteration" the max value was reported, and then extract all the other metrics for that iteration ?
NastySeahorse61 it might that the frequency it tests the metric storage is only once a day (or maybe half a day), let me see if I can ask around
(just making sure you can still login to the platform?)
JuicyDog96 Yes please!
Let me check what's the status with the docs repository, and I'll get back to you soon π
Awesome! any way to hear the talk w/o/ registering for the whole conference?
CloudySwallow27 Anyway we will make sure we upload the talk to the clearml youtube channel after the Talk
. So to conclude: it has to be executed manually first, then with trains agent?
Yes, that said, as you mentioned, you can always edit the "installed packages" once manually, from that point you are basically cloning the experiment, including the "installed packages" so it should work if the original worked.
Make sense ?
Hi RotundHedgehog76
Notice that the "queued" is on the state of the Task, as well as the the tag
We tried to enqueue the stopped task at the particular queue and we added the particular tagWhat do you mean by specific queue ? this will trigger on any Queued Task with the 'particular-tag' ?
This depends on how you spined the server, basically as long as you configure the clients (i.e. python clients) correctly, there is no issue.
But the auto generated configuration might be off (in the UI when you credentials it tells the clearml-init
where the server is and the ports)
I would actually recommend subdomains if this is possible
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#sub-domain-configuration
wdyt?
For example, for some of our models we create pdf reports, that we save in a folder in the NFS disk
Oh, why not as artifacts ? at least you will be able to access from the web UI, and avoid VFS credential hell π
Regrading clearml datasets:
https://www.youtube.com/watch?v=S2pz9jn26uI
DilapidatedDucks58 I see ...
This might be more complicated that one would imagine, a simple solution might be to store a snapshot of the values every-time we reach a new maximum, a quick hack might be to add it as text on one of the task's parameters or properties (that we can later add to the table as custom column).
wdyt?
That should work π
BTW, you might play around with "clearml-agent execute --id <task_id_here>"
This will basically clone the code, create a venv with the python packages, apply uncommitted changes and will run the actual code. This could be a replacement for your bash. (notice it means that you need to clone the Task in the UI, then you can Change parameters, then the run the agent manually in SLURM and it will take the params from the UI.)
I guess itβs on me to check whether this slowdown is negligible or not
Usually performance is negligible, especially with GPU
But if you really want the best:
Add --security-opt seccomp=unconfined
to the extra_docker_arguments
See detials:
https://betterprogramming.pub/faster-python-in-docker-d1a71a9b9917
. Yes I do have a GOOGLE_APPLICATION_CREDENTIALS environment variable set, but nowhere do we save anything to GCS. The only usage is in the code which reads from BigQuery
Are you certain you have no artifacts on GS?
Are you saying that if GOOGLE_APPLICATION_CREDENTIALS
and clearml.conf contains no "project" section it crashed when starting ?
Is this consistent on the same file? can you provide a code snippet to reproduce (or understand the flow) ?
Could it be two machines are accessing the same cache folder ?
ScaryKoala63
When it fails what's the number of files you have in:/home/developer/.clearml/cache/storage_manager/global/
?
Are you suggesting the conf file did not set the default size? It sounds like a bug, can you verify?
And when retrieve just this file? is it working ?
(Maybe for some reason the file is corrupted) ?
Hi ScaryKoala63
Sure, add the following to your clearml.conf:sdk.storage.cache.default_cache_manager_size = 400
I think you are correct, it seems like for some reason you hit the cache limit, and a previous entry was deleted
PanickyMoth78
Is it limited to
accounts? (
unfortunately, yes π , but I'm sure sales will be able to hook you up ...
Hi TrickyRaccoon92
... would any running experiment keep a cache of to-be-sent-data, fail the experiment, or continue the run, skipping the recordings until the server is back up?
Basically they will keep trying to send data to server until it is up again (you should not loose any of the logs)
Are there any clever functionality for dumping experiment data to external storage to avoid filling up the server?
You mean artifacts or the database ?