Reputation
Badges 1
981 × Eureka!Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task
Yes that’s what I did initially, but eventually I decided that it’s too much complexity added for nothing really, I’d rather drop omegaconf and if one day clearml supports it out of the box take advantage of it
If I don’t start clearml-session , I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
So when I create a task using `task = Task.init(project_name=config.get("project_name"), task_name=config.get("task_name"), task_type=Task.TaskTypes.training, output_uri=" s3://my-bucket ") locally, the artifact is correctly logged remotely, but when I create the task remotely (from an agent) the artifact is logged locally (in the agent machine, not on s3)
because I cannot locate libcudart or because cudnn_version = 0?
Ho I wasn't aware of that new implementation, was it introduced silently? I don't remember reading it in the release notes! To answer your question: no, for gcp I used the old version, but for azure I will use this one, maybe send a PR if code is clean 👍
I don’t think it is, I was rather wondering how you handled it to understand potential sources of slow down in the training code
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
To clarify: trains-agent run a single service Task only
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -dAnd now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
mmh it looks like what I was looking for, I will give it a try 🙂
Usually one or two tags, indeed, task ids are not so convenient, but only because they are not displayed in the page, so I have to go back to another page to check the ID of each experiment. Maybe just showing the ID of each experiment in the SCALAR page would already be great, wdyt?
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
ClearML has a task.set_initial_iteration , I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
So it looks like it tries to register a batch of 500 documents
AgitatedDove14 Yes that might work, also the first one (with conda) might work as well, I will give it a try, thanks!
This works well when I run the agent in virtualenv mode (remove --docker )
because at some point it introduces too much overhead I guess
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field