So when the agent fire up it get's the hostname, which you can then get from the API,
I think it does something like "getlocalhost", a python function that is OS agnostic
that's the entire repo link ? not something like https://github.com/ ... ?
So if you set it, then all nodes will be provisioned with the same execution script.
This is okay in a way, since the actual "agent ID" is by default set based on the machine hostname, which I assume is unique ?
is there a built in programmatic way to adjustΒ
development.default_output_uri
?
How about: In your Task.init(output_uri='...')
SubstantialElk6
Hmm do you have torch in the "installed packages" section of the Task ?
(This what the agent is using to setup the environment inside the docker, running as a pod)
SubstantialElk6 "Execution Tab" scroll down you should have "Installed Packages" section, what do you have there?
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Same process?!
Also I'd like to create the queues pragmatically, is that possible?
Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue
to the agent execution commandline
just to check. Does the k8s glue install torch by default?
SubstantialElk6 what do you mean the glue installs torch ?
The glue will take a Task from the queue create a k8s job (basically use the same docker and inside the docker run get the agent to execute the requested Task). Where would the "torch" come into play?
ItchyJellyfish73
Unfortunately this needs backend support, and only available in the enterprise version, what is your use case for it? (It was designed to allow out of the box bare-metal multi gpu dynamic allocation, think DGX with 8 GPUs that instead of spinning down agents when you want to change the queue->num-gpu mapping you can do it on the fly)
Hi @<1590514584836378624:profile|AmiableSeaturtle81>
I think you should use add_external_files
, instead of add_files
(which is for local files)
None
Hi @<1523704757024198656:profile|MysteriousWalrus11>
"parents": [
"step_two",
"step_four"
],
Seems like step 5 depends on steps 2+4 , how did you create it? what did the console say ?
Could it be your not actually passing any output from step3 ? how is it dependent on it ?
Seems correct.
I'm assuming something is wrong with the key/secret quoting ?!
Could you generate another one and test it ?
(you can have multiple key/secretes on the same user)
MoodyCentipede68 from your log
clearml-serving-triton | E0620 03:08:27.822945 41 model_repository_manager.cc:1234] failed to load 'test_model_lstm2' version 1: Invalid argument: unexpected inference output 'dense', allowed outputs are: time_distributed
This seems the main issue of triton failing to.load
Does that make sense to you? how did you configure the endpoint model?
Well from the error it seems there is no layer called "dense" , hence triton failing to find the layer returning the reult. Does that make sense?
Let me rerun the code and check
MoodyCentipede68 can you post the full docker-compose log (from spinning it until you get the error?)
You can just pipe the output to a file with :docker-compose ... up > log.txt
PompousParrot44 obviously you can just archive a task and run the cleanup service, it will actually delete archived tasks older than X days.
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py
TeenyFly97 the TL;DR is:
Task.close() should be called when you previously used Task.init (i.e the code creating the task)
Task.mark_stopped() should be called to stop a remote Task running.
I hope it helps π
Its stored on the Task, you can see it under the execution tab in the UI
Hi @<1600299043865497600:profile|MagnificentSeaurchin90>
Any chance you can provide more info on the error?
if I want to compare two experiments the scalar plots do not load ( loading forever ).
I'm assuming the issue is the Plots tab? or is it the Scalars? what do you have in the Plots? can you send an image of the single experiment ?
BTW: could it be the Task.init is Not called on the "module.name" entry point, but somewhere internally ?
Hi PanickyMoth78
I had several pipeline components getting it and uploading files to is concurrently.
Should not be a problem
I've attached it's log file which only mentions skipping one file (a warning)
So what exactly is the error you are getting?
how can I for example convert it back to a pandas dataframe?
You can always report csv file with report_media as well, or if this is not for debugging maybe an artifact ?
It seems to follow a structure specific to clearml,
Actually plotly.js π
there is a bug wherein both
Task.current_task()
and
Logger.current_logger()
return
None
.
This is not a bug this means something broke, the environment variable CLEARML_TASK_ID
Has to be set inside the agent's process
How are you running it? (also log π , you can DM so it is not public here)