Reputation
Badges 1
981 × Eureka!my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
How exactly is the clearml-agent killing the task?
I have two controller tasks running in parallel in the trains-agent services queue
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
I am using clearml_agent v1.0.0 and clearml 0.17.5 btw
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0) call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
If I don’t start clearml-session , I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
AgitatedDove14 Should I create an issue for this to keep track of it?
Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
I am trying to upload an artifact during the execution
awesome! Unfortunately, calling artifact["foo"].get() gave me:Could not retrieve a local copy of artifact foo, failed downloading file:///checkpoints/test_task/test_2.fgjeo3b9f5b44ca193a68011c62841bf/artifacts/foo/foo.json
It tries to get it from the local storage, but the json is stored in s3 (it does exists) and I did create both tasks specifying the correct output_uri (to s3)
When installed with http://get.docker.com , it works
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
So there will be no concurrent cached files access in the cache dir?
ok, so even if that guy is attached, it doesn’t report the scalars
Is there any logic on the server side that could change the iteration number?
mmmh good point actually, I didn’t think about it
Ho nice, thanks for pointing this out!
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
So it looks like it tries to register a batch of 500 documents