Reputation
Badges 1
981 × Eureka!In the comparison the problem will be the same, right? If I choose last/min/max values, it won’t tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
super, thanks SuccessfulKoala55 !
Yes AnxiousSeal95 , stopped instance meaning you don’t pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.
Do you get stopped instances instantely when you ask for them?
Well that’s a good question, that’s what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/...
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
I tested by installing flask in the default env -> which was installed in the ~/.local/lib/python3.6/site-packages folder. Then I created a venv with flag --system-site-packages . I activated the venv and flask was indeed available
Could you please share the stacktrace?
I was able to fix by applying for a license and registering it
Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
should I try to roll back to clearml-server 1.0.2? I am very anxious now…
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
So the new EventsIterator is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? I’ll update the cluster to make sure it’s not the case
AgitatedDove14 Is it possible to shut down the server while an experiment is running? I would like to resize the volume and then restart it (should take ~10 mins)
I guess I’ll get used to it 😄
Nice, thanks!
The file /tmp/.clearml_agent_out.j7wo7ltp.txt does not exist
might be worth documenting 😄
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
Will it freeze/crash/break/stop the ongoing experiments?
AgitatedDove14 I am actually considering rolling back to 1.1.0, so 1.3.0 is not really an option for now
yes what happens in the case of the installation with pip wheels files?
When installed with http://get.docker.com , it works
as it's also based on pytorch-ignite!
I am not sure to understand, what is the link with pytorch-ignite?
We're in the brainstorming phase of what are the best approaches to integrate, we might pick your brain later on
Awesome, I'd be happy to help!
AnxiousSeal95 Any update on this topic? I am very excited to see where this can go 🤩
Yes, not sure it is connected either actually - To make it work, I had to disable both venv caching and set use_system_packages to off, so that it reinstalls the full env. I remember that we discussed this problem already but I don't remember what was the outcome, I never was able to make it update the private dependencies based on the version. But this is most likely a problem from pip that is not clever enough to parse the tag as a semantic version and check whether the installed package ma...
torch==1.7.1 git+ .
Is it because I did not specify --gpu 0 that the agent, by default pulls one experiment per available GPU?