Reputation
Badges 1
981 × Eureka!Hi CostlyOstrich36 , most of the time I want to compare two experiments in the DEBUG SAMPLE, so if I click on one sample to enlarge it I cannot see the others. Also once I closed the panel, the iteration number is not updated
Thanks for the hack! The use case is the following: I have a controler that creates training/validation/testing tasks by cloning (so that the parent task id is properly set to the controler). Otherwise I could simply create these tasks with Task.init, but then I would need to set manually the parent task for each one of these tasks, probably with a similar hack, right?
I am not sure what you mean by unless the domain is different ? Personal Access Token are designed such that to allow cloning a private repo, the user has to give the PAT full access to repos, including public repos. So it should also work with all other git repos
with the CLI, on a conda env located in /data
ho wait, actually I am wrong
Setting it after the training correctly updated the task and I was able to store artifacts remotely
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05 folder to /opt/trains/data/elastic before running the migration tool right?
Here I have to do it for each task, is there a way to do it for all tasks at once?
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didn’t think there could be a problem with ES…
But that was too complicated, I found an easier approach
The file /tmp/.clearml_agent_out.j7wo7ltp.txt  does not exist
Thanks! Corrected both, now its building
Is it because I did not specify --gpu 0 that the agent, by default pulls one experiment per available GPU?
no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
Alright, so the steps would be:
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
That would create me a base docker image base_env_services . Then how should I ensure that trains-agent uses that base image for the services queue? My guess is:
trains-agent daemon --services-mode --detached --queue services --create-queue --docker base_env_services --cpu-only
Would that work?
AgitatedDove14 So I’ll just replace task = clearml.Task.get_task(clearml.config.get_remote_task_id()) with Task.init() and wait for your fix 🙂
Ha nice, makes perfect sense thanks AgitatedDove14 !
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
They are, but this doesn’t work - I guess it’s because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
The simple workaround I imagined (not tested) at the moment is to sleep 2 minutes after closing the task, to keep the clearml-agent busy until the instance is shutted down:self.clearml_task.mark_stopped() self.clearml_task.close() time.sleep(120) # Prevent the agent to pick up new tasks
I am sorry to give infos that are not very precise, but it’s the best I can do - Is this bug happening only to me?
did you try with another availability zone?