Hi ShallowCormorant89
Can you verify the http link is valid? Can you download it from code on your machine (i.e. not via an agent), maybe 8081 port is blocked from the agent machine to the server?
and this path should follow linux folder structure not a single file like the current .zip.
I like where this is going 🙂
So are we thinking like a "shared" folder where the data is kept "warm" and a single source of truth where the packaged zip file is stored (like object storage, e.g. S3)
Yep, automatically moving a tag
No, but you can get the last created/updated one with that tag (so I guess the same?)
meant like the best artifacts.
So artifacts get be retrieved like a dict:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts_retrieval.pyTask.get_task(project_name='examples', task_name='artifacts example').artifacts['name']
The idea of queues is not to let the users have too much freedom on the one hand and on the other allow for maximum flexibility & control.
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
Ch...
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
'
' error [Errno 13] Permission denied:
Seems like a permission issue ?
Try to remove your entire clearml cache folder None
But I do not know how it can help me:(
In your code itself after the Task.init call add:task.set_initial_iteration(0)See reply here:
https://github.com/allegroai/clearml/issues/496#issuecomment-980037382
Hi @<1545216070686609408:profile|EnthusiasticCow4>
My biggest concern is what happens if the TaskScheduler instance is shutdown.
good question, follow up, what happens to the cron service machine if it fails?!
TaskScheduler instance is shutdown.
And yes you are correct if someone stops the TaskScheduler instance it is the equivalent of stopping the cron service...
btw: we are working on moving some of the cron/triggers capabilities to the backend , it will not be as flexi...
Hi @<1578193378640662528:profile|MoodySeaurchin4>
but is it possible to log some metrics too, like rmse or the likes? If so, how would you do it?
Sure, I'm assuming this is part of the output ? if not, this means this is part of your code, and if this is the case then yes you should use collect_custom_statistics_fn
None
`collect_custom_statistics_fn({'rmse'...
Funny it's the extension "h5" , it is a different execution path inside keras...
Let me see what can be done 🙂
If you mean like Canary ? then yes, but only on KFserving baclend (coming soon), since the engines themselves do not support it (this is basically a "routing" feature)
but the debug samples and monitored performance metric show a different count
Hmm could you expand on what you are getting, and what you are expecting to get
In that case you should probably mount the .ssh from the host file-system into the docker. for example:docker run -v /home/user/.ssh:/root/.ssh ...WickedGoat98 the above assumes your are running the docker manually, if you are using docker-compose.yml file the same mount should be added to the docker-compose.yml
okay so the error should have been:
trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://<IP>:8008 ?
Not https nor 8010 ?!
Hi CharmingPuppy6
Basically yes there is.
The way clearml is designed, is to have queues abstract different types pf resources. for example a queue for single gpu jobs (let's nam "single_gpu") and a queue for dual gpu jobs (let's name it "single_gpu").
Then you spin agents on machines and have the agents pull jobs from specific queues based on the hardware they have. For example we can have a 4 GPU machine with 3 agents, one agent connect to 2xGPUs and pulling Tasks from the "dual_gpu...
TrickySheep9
you are absolutely correct 🙂
I made a custom image for the VMSS nodes, which is based on Ubuntu and has multiple CUDA versions installed, as well as conda and docker pre-installed.
This is very cool, any reason for not using dockers the multiple CUDA versions?
Hi VexedCat68
txt file or pkl file?
If this is a string , it just stored it (not as a file, this is considered a "link")
https://github.com/allegroai/clearml/blob/12fa7c92aaf8770d770c8ed05094e924b9099c16/clearml/binding/artifacts.py#L521
EnviousPanda91 please feel free to PR if it works 🙂
https://github.com/allegroai/clearml/blob/86586fbf35d6bdfbf96b6ee3e0068eac3e6c0979/clearml/binding/frameworks/catboost_bind.py#L114
I execute theÂ
clearml-session
 withÂ
--docker
 flag.
This is to control the docker image the agent will spin for you (think dev enviroment you want to work in, like nvidia pytorch container already having everything you need)
JitteryCoyote63
Yes this extremely annoying, I think it was updated on the community server, let me check if we deployed a new docker with a fix ...
models been trained stored ...
mongodb will store url links, the upload itself is controlled via the "output_uri" argument to the Task
If None is provided, the Trains log the local stored model (i.e. link to where you stored your model), if you provide one, Trains will automatically upload the model (into a new subfolder) and store the link to that subfolder.
- how can I enable the tensorboard and have the graphs been stored in trains?
Basically if you call Task.init all your...