TypeError:ย
init
() got an unexpected keyword argument 'base_pod_num'
Could you post the entire log?
Hmm that makes sense to me, any chance you can open a github issue so we do not forget ? (I do not think it should be very complicated to fix)
I get the same "white" image in both TB & ClearML ๐
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
Hi GrotesqueOctopus42
Dispite having reuse_last_task_id=True on Task.init, it always creates a new task id. Anyone ever had this issue?
So the way "reuse_last_task_id=True" works is that if there are no artifacts on the Task it will reuse it, but when running inside jupyter it always has artifacts (the notebook itself), so it starts a new Task.
You can however pass a specific Task ID and it will reuse it "reuse_last_task_id=aabb11", would that help?
Hi ThickDove42 ,
Yes, but by the time you will be able to access it, it will be in a display form (plotly), not very convient.
If this is something you need to re-use, I would argue that it is an artifact and should be stored as artifact (then accessing it is transparent) , obviously you can both report as table and upload as artifact, no harm in that.
what do you think?
MagnificentSeaurchin79
Do notice that the pipeline controller assumes you have an agent running
we run in containers without venv, in the main section, and then delete it or use it for similar experimentsIf this is the case then the idea is the venv creation is actually cached, you can turn it on here (unmark the line)
https://github.com/allegroai/clearml-agent/blob/51eb0a713cc78bd35ca15ed9440ddc92ffe7f37c/docs/clearml.conf#L116
hit ctrl-f5 (reload the page) do you still ge the same error? Is it limited to a specific experiment?
ProudMosquito87 I think this is what you are looking for: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L101
I thought this is the issue on the thread you linked, did I miss something ?
CloudyHamster42 what's the trains-server version ?
CourageousLizard33 VM?! I thought we are talking fresh install on ubuntu 18.04?!
Is the Ubuntu in a VM? If so, I'm pretty sure 8GB will do, maybe less, but I haven't checked.
How much did you end up giving it?
Hm GiganticTurtle0 let me check quickly it
Another point I see is, that in the workers & queses view the GPU usage is not been reported
It should be reported, if it is not, maybe you are running the trains-agent
in cpu mode ? (try adding --gpus)
Ohh... I would not delete them then ... ๐
Maybe kind of heuristics (files created a week ago can be deleted?!)
In your code, can you print the following:import os print(os.environ.keys())
There should be a few keys the Pycharm plugin is sending from the local machine, pointing to the git repo
I located the issue, I'm assuming the fix will be in the next RC ๐
(probably tomorrow or before the weekend)
Is task.parent something that could help?
Exactly ๐ something like:# my step is running here the_pipeline_task = Task.get_task(task_id=task.parent)
And the agent continue running.
oh just kill al the processes with clearml-agent
in the cmd line
pkill -9 -f clearml-agent
Besides that, what are your impressions on these serving engines? Are they much better than just creating my own API + ONNX or even my own API + normal Pytorch inference?
I would separate ML frameworks from DL frameworks.
With ML frameworks, the main advantage is multi-model serving on a single container, which is more cost effective when it comes to multiple model serving. As well as the ability to quickly update models from the clearml model repository (just tag + publish and the end...
you can also specify additional packages on the decorator@PipelineDecorator.component(..., packages=["tqdm>=2.1", "scikit-learn"]) def step_one(...): # code here
why are there indefinitely growing anonymous tasks, even after i've closed the main schedulers.
The anonymous Tasks are The Dataset you are creating (a Dataset version is also a Task of a certain type with artifacts, the idea is usually Datasets are created from code, hence the need to combine the two).
Make sense ?
Hi @<1720249421582569472:profile|NonchalantSeaanemone34>
Is it possible to read data directly from server w/o using get_local_copy()?
do you mean an artifact ? what is direct here?
The latest TAO doesn't use python for fine tuning, rather it uses the CLI entirely
It's a good question, but I think the CLI actually just runs a python code (the CLI is their interface). Generally speaking I'm pretty sure it will not be complicated to convert the TLT integration to support TAO (Nvidia helps with that, and I think we had a similar proces with Nvidia Clara/MONAI)
BTW: how are you using Nvidia TAO ?
Hi JealousParrot68
spinning the clearml-agent with docker support (i.e. each experiment is running inside its own container):
https://clear.ml/docs/latest/docs/clearml_agent#docker-mode
Basically you can specify a default docker to use (per agent) and a specific docker container to use per Task (configured in the UI under execution at the bottom)
. In short, I was not able to do it withย
Task.clone
ย andย
Task.create
ย , the behavior differs from what is described in docs and docstrings (this is another story - I can submit an issue on github later)
The easiest is to use task_ task_overrides
Then pass:task_overrides = dict('script': dict(diff='', branch='main'))
Hi @<1524922424720625664:profile|TartLeopard58>
Yes this is the default it is designed to serve multiple models and scale horizontally