Reputation
Badges 1
981 × Eureka!mmmh good point actually, I didnโt think about it
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
AgitatedDove14 any chance you found something interesting? ๐
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2 (instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that itโs not possible to change this value after the index creation, is it true?
hoo thats cool! I could place torch==1.3.1 there
when can we expect the next self hosted release btw?
` Traceback (most recent call last):
File "devops/train.py", line 73, in <module>
train(parse_args)
File "devops/train.py", line 37, in train
train_task.get_logger().set_default_upload_destination(args.artifacts + '/clearml_debug_images/')
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site-packages/clearml/logger.py", line 1038, in set_default_upload_destination
uri = storage.verify_upload(folder_uri=uri)
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site...
Hi DeterminedCrab71 Version: 1.1.1-135 โข 1.1.1 โข 2.14
Still failing with the same error ๐
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d afterwards, and not docker-compose restart
edited the aws_auto_scaler.py, actually I think itโs just a typo, I just need to double the brackets
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
To clarify: trains-agent run a single service Task only
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -dAnd now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
mmh it looks like what I was looking for, I will give it a try ๐
Usually one or two tags, indeed, task ids are not so convenient, but only because they are not displayed in the page, so I have to go back to another page to check the ID of each experiment. Maybe just showing the ID of each experiment in the SCALAR page would already be great, wdyt?
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
ClearML has a task.set_initial_iteration , I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if itโs a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
So it looks like it tries to register a batch of 500 documents
AgitatedDove14 Yes that might work, also the first one (with conda) might work as well, I will give it a try, thanks!
This works well when I run the agent in virtualenv mode (remove --docker )