okay so the error should have been:
trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://<IP>:8008 ?
Not https nor 8010 ?!
is there a way to increase the size of the text input for fields or a better way to handle lists?
No 😞
Maybe an easier way to use connect_configuration instead ? it will take an entire dict and store it as text (format is hocon, which is YAML/Json compatible, which means it is hard to break when editing)
But the same configuration does not work on the machine with the trains-agent?
Hi OddShrimp85
right place to ask about clearml serving.
It is 🙂
I did not manage to get clearml serving work with my own clearml server and triton setup.
Yes it should have been updated already, apologies.
Until we manage to sync the docs, what seems to be your issue, maybe we can help here?
In the documentation it warns about
.close()
"Only call Task.close if you are certain the Task is not needed."
Maybe this is not clear enough, this means you do not need to automatically Add/Log/Track things into the Task in the current process.
This does Not mean you cannot access the Task or its artifacts
Mark closed means to externally (i..e not from the process that crated the Task, maybe even from a different machine) close and mark the task as completed (this...
Hi @<1541954607595393024:profile|BattyCrocodile47>
Has anyone used ClearML for this use case?
you mean as experiment management / model registry / data? I think this is the bread&butter of clearml 🙂
regrading the other options ion the list, I think most of them are alternatives to metaflow, not covering the parts you mentioned, no?
Hi @<1653207659978952704:profile|LovelyStork78>
I have a docker container with all the dependencies.
Well I think the main question is are you using the clearml-agent to launch jobs/experiments? If you do it makes sense to specify your docker as "base docker image" (in the UI look for under the Execution tab, Container).
This means the agent will use the pre-installed environment and will add anything that your Task needs on top of it, this of course includes pushing your codebase i...
Actually you cannot breakpoint at "atexit" calls (or at least doesn't work with my gdb)
But I would add a few prints here:
https://github.com/allegroai/clearml/blob/aa4e5ea7454e8f15b99bb2c77c4599fac2373c9d/clearml/task.py#L3166
do you have your Task.init call inside the "train.py" script ? (and if you do, what are you getting in the Execution tab of the task) ?
would I have to execute each task in the pipeline locally(but still connected to trains),
Somehow you have to have the pipeline step Task in the system, you can import it from code, or you can run it once, then the pipeline will clone it and reuse it. Am I missing something ?
WickedGoat98
for such pods instantiating additional workers listening on queues
I would recommend to create a "devops" user and have its credentials spread across all agents. sounds good?
EDIT:
There is no limit on number of users on the system, so login as a new one and create credentials in the "profile" page :)
Hi PanickyMoth78
` torch.save(net.state_dict(), PATH) # auto-uploads to GCS
get all the models from the Task
output_models = Task.current_task().models["output"]
get the last one
last_model = output_models[-1]
set meta-data
last_model.set_metadata(key="my key", value="my value", type="str") `
Hi GiganticTurtle0
Let me check
ColossalDeer61 btw, it turns out the docker-compose services docker was ill configured on the GitHub 😞 I suggest you get the latest copy of it:curl -o docker-compose.yml
Hi DilapidatedCow43
I'm assuming the returned object cannot be pickled (which is ClearML's way of serializing it)
You can upload it as a model with
` uploaded_model_url = Task.current_task().update_output_model(model_path="/path/to/local/model")
...
return uploaded_model_url `wdyt?
Yes, i basically plan to use ClearML as user-friendly cluster manager
and it is 🙂
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunate...
BoredHedgehog47
is this ( https://clearml.slack.com/archives/CTK20V944/p1665426268897429?thread_ts=1665422655.799449&cid=CTK20V944 ) the same issue (or solution) ?
yes 🙂
But I think that when you get the internal_task_representation.execution.script you are basically already getting the API object (obviously with the correct version) so you can edit it in place and pass it too
Sorry @<1657918706052763648:profile|SillyRobin38> I missed this reply
Is ClearML-Serving using either System or CUCA shared memory? O
This needs to be set on the docker-compose:
and I think this line actually includes ipc: host which means there is no need to set the shm_size, but you can play around with it and let me know if you see a difference
[None](https://github.com/allegroai/clearml-serving/blob/7ba356efc97a6ae2159283d198d981b3c1ab85e6/docker/docker-compose-triton-gpu.yml#L1...
Task.running_localy()
Should do the trick
or at least stick to the requirements.txt file rather than the actual environment
You can also for it to log the requirements.txt withTask.force_requirements_env_freeze(requirements_file="requirements.txt") task = Task.init(...)
GiganticTurtle0 fix was pushed 🙂
you can test with:pip install git+🤞
ZanyPig66 this should have worked, any chance you can send the full execution log (in the UI "results -> console" download full log) and attach it here? (you can also DM it so it is not public)
SubstantialElk6 if you call Task.init with continue_last_task=<task_id> it will automatically add the last_iteration of the previous run, to any logging/report so you never overwrite the previous reports 🙂
I can't find out how to pass my custom clearml.conf
Hi @<1544491301435609088:profile|TeenyElk27>
The easiest is to map it into the container in your docker-compose
(map a host clearml.conf into /root/clearml.conf inside the container)
This is strange... Could you send the browser console log, maybe there is an exception there
@<1523701304709353472:profile|OddShrimp85> are you trying to shut down the one running on your machine ?
Do you want to PR it? should be a quick fix