Reputation
Badges 1
981 × Eureka!my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
How exactly is the clearml-agent killing the task?
I have two controller tasks running in parallel in the trains-agent services queue
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
I am using clearml_agent v1.0.0 and clearml 0.17.5 btw
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0) call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
I am trying to upload an artifact during the execution
awesome! Unfortunately, calling artifact["foo"].get() gave me:Could not retrieve a local copy of artifact foo, failed downloading file:///checkpoints/test_task/test_2.fgjeo3b9f5b44ca193a68011c62841bf/artifacts/foo/foo.json
It tries to get it from the local storage, but the json is stored in s3 (it does exists) and I did create both tasks specifying the correct output_uri (to s3)
When installed with http://get.docker.com , it works
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
So there will be no concurrent cached files access in the cache dir?
ok, so even if that guy is attached, it doesn’t report the scalars
Is there any logic on the server side that could change the iteration number?
mmmh good point actually, I didn’t think about it
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
AgitatedDove14 any chance you found something interesting? 🙂
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2 (instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that it’s not possible to change this value after the index creation, is it true?
hoo thats cool! I could place torch==1.3.1 there
when can we expect the next self hosted release btw?
` Traceback (most recent call last):
File "devops/train.py", line 73, in <module>
train(parse_args)
File "devops/train.py", line 37, in train
train_task.get_logger().set_default_upload_destination(args.artifacts + '/clearml_debug_images/')
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site-packages/clearml/logger.py", line 1038, in set_default_upload_destination
uri = storage.verify_upload(folder_uri=uri)
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site...