Reputation
Badges 1
981 × Eureka!AgitatedDove14 Up š I would like to know if I should wait for next release of trains or if I can already start implementing azure support
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
Looking at the source code, it seems like I should do:data_processing_task._artifact_manager.flush() to make sure to have the latest version of artifacts in the task, right?
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?
So the new EventsIterator is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? Iāll update the cluster to make sure itās not the case
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase š
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc is not ava...
SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
on /data or /opt/clearml? these are two different disks
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
So actually I donāt need to play with this limit, I am OK with the default for now
That's why I suspected trains was installing a different version that the one I expected
select multiple lines still works, you need to shift + click on the checkbox
DeterminedCrab71 Please check this screen recording
(Btw the instance listed in the console has no name, it it normal?)
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
hoo thats cool! I could place torch==1.3.1 there
Yes, I would like to update all references to the old bucket unfortunately⦠I think Iāll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I donāt have to mess with clearml data - I am afraid to do something wrong and loose data
torch==1.7.1 git+ .
Thanks a lot for the solution SuccessfulKoala55 ! Iāll try that if the solution ādelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backā fails
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png


