
Reputation
Badges 1
25 × Eureka!What we would like ideally, is a system where development, training, and deployment are almost one and the same thing, to reduce the lead time from development code to production models.
This is very aligned with the goals of ClearML π
I would to understand more on what is currently missing in ClearML so we can better support this approach
my inexperience in using them a lot until recently. I can see how that is a better solution
I think I failed in explaining my self, I me...
Hi PompousParrot44
Well this kind of control is tricky. If you don't mind processes "fighting over cpu" you can just spin two trains-agents in cpu-mode. It will work as long as they have a different TRAINS_WORKER_NAME
The other option (might be a bit of an overkill) is to use K8s, which will set the CPU % for the entire agent.
What do you think?
IrritableJellyfish76 if this is the case, my question is what is the reason to use Kubeflow? (jupyterLab server spinning is a good answer for example, pipelines are to my opinion a lot less)
Hi OddShrimp85
If you pass 'output_uri=True' to task init, it will upload the model automatically, or as you said manually with outputmodel class
okay, let me see if I can nail down the issue
So without the flush I got the error apparently at the very end of the script -
Yes... it's a python thing, background threads might get killed in random order, so that when one needs a background thread that died you get this error, which basically should mean you need to do the work in the calling thread.
This actually explains why calling Flush solved the issue.
Nice!
Hmm so there is a way to add callbacks (somewhat cumbersome, and we would live feedback) so you can filter them out.
What do you think, would that work?
Hmm I'm assuming something wrong here:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L119
What's the host machine OS ?
or do you mean the machine I ran the experiment locally?
Yes this one
Hi PanickyFish98
It verifies it has access to it when actually creating the Task, maybe it should be a warning?!
fyi: you can also change the value from the UI (under Execution output) or have a default one set in the clearml.conf
used by the agent
Hi SubstantialElk6
clearml-agent was just updated, it should solve the issue.2. Notice that "torch" / "torchvision" packages are resolved by the agent based on the pytorch compatibility table. Is there a way to reproduce the issue where it fails resolving the torch version? could you send a full log?
3. If you want a specific torch version , you can put a direct link to the torch wheel, for example: https://download.pytorch.org/whl/cu102/torch-1.6.0-cp37-cp37m-linux_x86_64.whl
an implementation of this kind is interesting for you or do you suggest to fork
You mean adding a config map storing a default trains.conf for the agent?
Yes it does. I'm assuming each job is launched using a multiprocessing.Pool (which translates into a sub process). Let me see if I can reproduce this behavior.
This is definitely a but, in the super class it should have the same condition (the issue is checking if you are trying to change the "main" task)
Thanks ApprehensiveFox95
I'll make sure we push a fix π
if project_name is None and Task.current_task() is not None: project_name = Task.current_task().get_project_name()
This should have fixed it, no?
Regrading the demoapp, this is just a default server that allows you to start play around with ClearML without needing to setup any of your own servers or signup
That said, I would recommend to sign up (totally free) on the community server
https://app.community.clear.ml/
TrickyFox41 are you saying that if you add Task.init inthe code it works, but when you are calling "clearml-task" it does not work? (in both cases editing the Args/overrides ?
Draft created successfully, but it doesn't contain property with docker command.
Could you help me?
ApprehensiveFox95 could you test with the latest RC, I think there was a fixpip install clearml==0.17.5rc5
Eg, i'm creating a task usingΒ
clearml.Task.create
Β , often it doesn't properly get the git diff correctly,
ShakyJellyfish91 Task.create does not store any "git diff" automatically, is there a reason not to use Task.init
?
Hi @<1657918724084076544:profile|EnergeticCow77>
Can I launch training with HugginFaces accelerate package using multi-gpu
Yes,
It detects torch distributed but I guess I need to setup main task?
It should π€
Under the execution Tab script path, you should see something like -m torch.distributed.launch ...
So if any step corresponding to 'inference_orchestrator_1' fails, then 'inference_orchestrator_2' keeps running.
GiganticTurtle0 I'm not sure it makes sense to halt the entire pipeline if one step fails.
That said, how about using the post_execution callback, then check if the step failed, you could stop the entire pipeline (and any running steps), what do you think?
Follow-up; any ideas how to avoid PEP 517 with the auto scaler?
Takes a
long
time to build the wheels
enable venv caching ?
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L116
Nice ! π
btw: clone=True
means creating a copy of the running Task, but basically there is no need for that , with clone=False, it will stop the running process, and launch it on the remote host, logging everything on the original Task.
Can you verify by adding the the following to your extra_docker_shell_script:
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L152extra_docker_shell_script: ["echo machine example.com > ~/.netrc", "echo login MY_USERNAME >> ~/.netrc", "echo password MY_PASSWORD >> ~/.netrc"]
DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:torchvision==0.7.0
Let me know if it worked
Regrading resetting it via code, if you need I can write a few lines for you to do that , although that might be a bit hacky.
Maybe we should just add a flag saying, use requirements.txt ?
What do you think?
CloudyHamster42
RC probably in a few days, but notice that it will just remove the warnings, I still can't reproduce the double axis issue.
It will be helpful if you could send a small script to reproduce the problem.
Maybe this example code can help ? https://github.com/allegroai/trains/blob/master/examples/manual_reporting.py
BTW: Basically just call Task.init(...)
the rest is magic π
Does a pipeline step behave differently?
Are you disabling it in the pipeline step ?
(disabling it for the pipeline Task has no effect on the pipeline steps themselves)
Hi @<1523701304709353472:profile|OddShrimp85>
there anywhere I could get a charr that can work with lower version of k8s? Or any other methods?
I think the solution is to install it manually from the helm chart (basically take it out and build a Job YAML, wdyt?