Could it be the code is not in a git repository ?clearml support either a single script or a git repository, but Not a collection of standalone files. wdyt?
VexedCat68 are you manually creating the OutputModel object?
I'm sorry wrong line reference:
I'm assuming the error is due to ulimit missing:
try adding 16777216 to both soft/hard ulimit
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L58
@<1523712386849050624:profile|NastyFox63>
is there a limit to the search depth for this?
Yes, the Task.init auto package listing is Only the first depth (i.e. directly imported),
the reason is that the derivative packages should be resolved by pip, when the agent remotely executes that Task.
Now when the Agent is installing the task the Entire python environment is stored, so that it is always fully reprpoducible,
Make sense ?
and the step is "queued" or is it "queued" in the pipeline state (i.e. the visualization did not update) ?
Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L...
Thanks!
Hmm from here : None
Could it be you do not have privileges to the resource, or that you did not provide credentials ?
Did that autoscaler work before ?
Thanks for the logs @<1627478122452488192:profile|AdorableDeer85>
Notice that the log you attached means the preprocessing is executed and the GPU backend is returning an error.
Could you provide the log of the docker compose specifically the intersting part is the Triton container, I want to verify it loads the model properly
AttributeError: 'NoneType' object has no attribute 'base_url'
can you print the model object ?
(I think the error is a bit cryptic, but generally it might be the model is missing an actual URL link?)print(model.id, model.name, model.url)
TroubledHedgehog16 generally speaking you can expect about 10 api calls per minute if you have many reports, and about 3 per minute on low report. We just optimized the sdk so in cases there are lots of consequential reports they are better batched, I would recommend the latest RC
Hi RoundMosquito25
This is a bit old but probably a good start:
https://clear.ml/blog/stacking-up-against-the-competition/
tl;dr
ClearML advantages (at least a few I can think of)
Scales way better Enables out of the box experiment orchestration (i.e. remote execution etc) Data management Nicer UI Full RestAPI Full MLops platform Model serving Query-able model repositoryProbably more 🙂
this is very odd, can you post the log?
if fails during
add_step
stage for the very first step, because
task_overrides
contains invalid keys
I see, yes I guess it it makes sense to mark the pipeline as Failed 🙂
Could you add a GitHub issue on this behavior, so we do not miss it ?
Hi Martin, of course not,
Smart!
I was just wondering if it has been patched yet and if not what is the expected timeline for patching it
Yes, I believe the target is a patch version 1.15.1 to be released in a couple of weeks. This is not a major issue but it's always better to have have it fixed. (btw: the enterprise version never had this issue to being with, because it is of course authenticated, as well as it has additional RBAC layer on top.)
SmarmySeaurchin8
When running in "dev" mode (i.e. writing the code) only packages imported directly are registered under "installed packages" , then when the agent is executing the experiment, it will update back the entire environment (including derivative packages etc.)
That said you can set detect_with_pip_freeze to true (in trains.conf) and it will basically store the entire pip freeze.
https://github.com/allegroai/trains/blob/f8ba0495fb3af1f99732fdffbbccd2fa992934a4/docs/trains.c...
GiganticTurtle0 is it in the same repository ?
If it is it should have detected the fact that it needs to analyze the entire repository (not just the standalone script, and then discover tensorflow)
Long story short, the Task requirements are async, so if one puts it after creating the object (at least in theory), it might be too late.
Make sense ?
Meanwhile check CreateFromFunction(object).create_task_from_function(...)
It might be better suited than execute remotely for your specific workflow 🙂
FlutteringWorm14 Can you verify that even with the clearml.conf it has no effect?
RoundMosquito25 are you using clearml-agent daemon --stop or are you killing them ?
killing them basically means you loose them in the UI when they timeout, the backend does not see them for 10min so it assumes they died, when you call clearml-agent --stop they will unregister themselves and disappear immortally
how to put or handle this configuration and where?
In your clearml.conf on the machine with the agent just add at the bottom of the file agent.venvs_cache.path=~/.clearml/venvs-cache
Hi SteadyFox10 the way it works is that Trains limits the debug image history by reusing the same files names, so the UI will only present the iterations where the debug images are relevant for. With your sample code it looks like it exposes a bug , the generated link should contain iteration number, it does not and so it overwrites the debug images every iteration. Here is the image link: https://demofiles.trains.allegro.ai/Test/test_images.6ed32a2b5a094f2da47e6967bba1ebd0/metrics/Test/te...
Guys, any chance you can verify the RC solves the issue?pip install clearml==1.0.2rc0
One suggestion is to make sure all agents have the same configuration. Another is to add pip into the "installed packages" section.
(Notice that in the next release we will specifically include it there, to avoid these kind of scenarios)
TrickySheep9 is this a conda package or a wheel you are installing manually ?
The other way around- "8011:8008"