Reputation
Badges 1
53 × Eureka!It is likely you have mismatched cuda. I presume you locally have cu113 but cu114 remotely. Were you running any updates lately?
MelancholyElk85 thank you, however I am not sure where do I put that label?
This means that an agent only ever spins up one particular image? I'd like to define different container images for different tasks, possibly even build them in the process of starting a task. Is such a thing possible?
I haven't looked, I'll let you know next time it happens
to answer myself, the first part, task.get_parameters()
retrieves a list of all the arguments which can be set. The syntax seems to be Args/{argparse destination}
However, this does not return the commit hash :((
we didn't change a thing from the defaults that's in your github 😄 so it's 500M?
We have deployed clearml-agents as systemd services. This allows you to tell systemd to restart the agent whenever it crashes, and it automatically starts them up when the server boots!
Okay, thank you for the suggestions, we'll try it out
I tried to build allegroai/clearml-agent-services on my laptop with ubuntu:22.04
and it failed
The log suggests there is no cu113 installation as well:
Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113
Yup, absolutely. Otherwise it cannot run your code haha
Errors pop in occasionally in the Web UI. All we see is a dialog with the text "Error"
Yes, that's right. We deployed it on a GCP instance
CostlyOstrich36 this sounds great. How do I accomplish that?
clearml-agent daemon --docker --gpus all --queue Q_NAME --log-level DEBUG --detached
It's not because of the remote machine, it's the requirements 😅 as i said, the package is not on pypi. Try adding this at the top of your requirements.txt:
-f
torch==1.12.1+cu113 ...other deps...
This was actually a reset (of a one experiment) not a delete
I guess I'll let you know the next time this happens haha
No errors in logs, but that's because I restarted the deployment :(
SOLVED: It was an expired service account key in a clearml config
SuccessfulKoala55 sorry for the bump, what's the status of the fix?
Hello, a similar thing happened today. In the developer's console there was this line
https://server/api/v2.19/tasks.reset_many 504 (Gateway time-out)