Hi DeliciousBluewhale87
You can achieve the same results programmatically with Task.create
https://github.com/allegroai/clearml/blob/d531b508cbe4f460fac71b4a9a1701086e7b6329/clearml/task.py#L619
Why? The task should have completed successfully, how is this aborting?
Early stopping by the HPO process, like hyper-band, e.g. this training model is going nowhere let's stop it.
WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent
Think I will have to fork and play around with itÂ
NICE! (BTW: if you manage to get it working I'll be more than happy to help push the PR)
Maybe the quickest win is to store just the .py as model ?
Can you test with the credentials also in the global section
None
key: "************"
secret: "********************"
Also what's the clearml python package version
Local changes are applied before installing requirements, right?
correct
btw: I'm assuming that args is not the ArgParser object, as the ArgParser is automatically "connected" ?
Hmm, you are correct
Which means this is some conda issue, basically when installing from env file, conda is not resolving the correct pytorch version 😞
Not sure why... Could you try to upgrade conda ?
Hi ElegantCoyote26
is there a way to get a Task's docker container id/name?
you mean like Task.get_task("task_id_here").get_base_docker() ?
ow a Task's results page also has a plot for this, but I guess it's at the machine level and not the task level?
This is actually on the container level, meaning checked from inside the container. It should be what you are looking for
It might be the file upload was broken?
Still not supported 😞
That would be great! Might have to useÂ
2>/dev/null
 in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
This is what I just used:
` import os
from argparse import ArgumentParser
from tensorflow.keras import utils as np_utils
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Activation, Dense, Softmax
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from clearml import Task
parser = ArgumentParser()
parser.add_argument('--output-uri', type=str, required=False)
args =...
or do you mean the machine I ran the experiment locally?
Yes this one
How so? Installing a local package should work, what am I missing?
It completed after the max_job limit (10)
Yep this is optuna "testing the water"
I could improve the cost-efficiency of my provisionned GCP A100 instances
But their pricing is linear, if you do not need a100 get a cheaper instance ?! no?
WickedGoat98 Actually the fileserver replied, so it all looks fine to me.
Try to run the text example again, see if you are still getting the fileserver error .
BeefyHippopotamus73 this error seems like it is coming from boto3, are you sure the credentials are properly configured and that you have read permission ?
Okay. AndÂ
110
 means 11.1 and not 11.0? (edited)
110 means 11.0, the odd thing is, it actually installed 11.1, and from the pytorch website this is exactly how they suggest to install with conda...
Let me know if forcing the CUDA version changes anything
The problem is that clearml installsÂ
cudatoolkit=11.0
 butÂ
cudatoolkit=11.1
 is needed.
You suggested this fix earlier, but I am not sure why it didnt work then.
Hmm , could you test with the clearml-agent 0.17.2 ? making surethis actually solves the problem
ComfortableShark77 it seems the clearml-serving is trying to Upload data to a different server (not download the model)
I'm assuming this has to do with the CLEARML_FILES_HOST, and missing credentials. It has nothing to do with downloading the model (that as you posted, will be from the s3 bucket).
Does that make sense ?
RoughTiger69
move the files locally (i.e. based on the example move folder b into folder a ) Create a new version with two parents ('a' and 'b') then sync the local root folder ('a' in your case). Only the meta-data should change (because the referenced files are already in one of the datasets)wdyt?