The latest image seems to require drivers on the host 460+
try this one:
https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel_20-12.html#rel_20-12
BTW: Full RestAPI reference here
https://allegro.ai/clearml/docs/rst/references/clearml_api_ref/index.html
How or why is this the issue?
The main issue is a missing requirement on the Task component, and this is why it is failing.
You can however manually specify package (and I'm assuming this will solve the issue), but it should have autodetected, no?
Hi GrotesqueOctopus42 ,
BTW: is it better to post the long error message on a reply to avoid polluting the channel?
Yes, that is appreciated π
Basically logs in the thread of the initial message.
To fix this a had to spin the agent using --cpu-only flag (--docker --cpu-only)
Yes if you do not specify --cpu-only it will default to trying to access gpus
Nice!
Thanks TroubledHedgehog16 for the context.
sdk.development.worker.report_period_sec
Yes please update to the latest version 1.8.0 for full support (to be released today, I think)
https://github.com/allegroai/clearml/blob/f6238b8a0fb662540bca9095cc0c22bd7af483c1/docs/clearml.conf#L196
https://github.com/allegroai/clearml/blob/f6238b8a0fb662540bca9095cc0c22bd7af483c1/docs/clearml.conf#L199
we have have been running agents on 3 on-premise systems.
Do notice that by default an...
Wait, that makes no sense to me. The API from python and the API from the UI are getting the same data from the backend ...
What are you getting with?from clearml import Task task = Task.get_task(task_id=<put task id here>) print(task.models)
Hi JitteryCoyote63
Wait a few hours, there is a new fix, I'll make sure we upload it later today (scheduled to be there anyhow, I'll push it forward)
Notice you have configure the shared driver for the docker, as the volume mount doesn't work without it. https://stackoverflow.com/a/61850413
VictoriousPenguin97 I'm not sure there is an easy solution, basically you have to edit both MongoDB (artifacts) and Elastic (think debug samples) π
I would recommend reading this blog post, it should give you a glimpse of what can be built π
https://medium.com/pytorch/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail-b583d25d0dd
IrateBee40 I think I have an idea what's wrong, https
could it be there is some firewall in the middle intercepting the entwork, and without installing SSL certificate the ssl connection is failing ?
. Iβm using the default operation mode which uses kubectl run. Should I use templates and specify a service in there to be able to connect to the pods?
Ohh the default "kubectl run" does not support the "ports-mode" π
Thereβs a static number of pod which services are created forβ¦
You got it! π
Hi DeliciousBluewhale87
When you say "workflow orchestration", do you mean like a pipeline automation ?
If i were to push the private package to, say artifactory, is it possible to use that do the install?
Yes that's the recommended way π
You add the private repo here, for the agent to use:
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L65
RipeGoose2 yes that will work π
That said, we should probably fix the S3 credentials popup π
Do you mean it recently become part of enterprise version?
I do not think so, but it seems this the support for the open-source is more like a PoC
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
Hi EnviousStarfish54
Verified with the frontend / backend guys.
Backend allows to search for "all" tags, and frontend will add a toggle button for the UI to select or/all for the selected Tags.
Should be part of the next release
Hi SarcasticSparrow10 , so yes it does, this is more efficient when using pytorch loaders, and in some other situations.
To disable it add to your clearml.conf:sdk.development.report_use_subprocess = false
2. interesting error, maybe we can revert to "thread mode" if running under a daemon. (I have to admit, I'm not sure why python has this limitation, let me check it...)
FierceFly22 wow that is a cool hack! Trains will capture any torch.save , so I think the actual driver here is the 'model.summary' . You can also upload any artifact with task.upload_artifact('name', 'modelsummary.txt')
Touching a file will not trigger Trains, as it does not monitor the files themselves. Make sense?
BTW, how will you get the file when running with the agent? If you are using the connect_configuration it will be downloaded from the trains-server for you. Otherwise you can alw...
PompousBeetle71 kudos on the solution!
What were the loggers you ended up setting?
I'd like to make sure we fix this issue
This means all the components of the pipeline use the exact same packages, and then it will just reuse the venv. Make sense ?
ERROR: Could not install packages due to an EnvironmentError:
[Errno 28] No space left on device
BTW: @<1523703080200179712:profile|NastySeahorse61> this sounds like docker out of space on the Main disk '/var/` where it stores all the images and temp file systems
This will cause you code to fail as any runtime change to the container file system will raise this out of disk space error
WickedGoat98
Put the agent.docker_preprocess_bash_script
in the root of the file (i.e. you can just add the entire thing at the top of the trains.conf)
Might it be possible that I can place a trains.conf in the mapped local folder containing the filesystem and mongodb data etc e.g.
I'm assuming you are referring to the trains-=agent services, if this is the case, sure you can,
Edit your docker-compose.yml, under line https://github.com/allegroai/trains-server/blob/b93591ec3226...