Reputation
Badges 1
25 × Eureka!Is there any way to debug these sessions through clearml? Thanks!
Yes this is a real problem, AWS does not allow to get the data very easily...
Can you check the AWS console, see what you have there ?
In theory this should have worked.
Maybe we you are missing some escaping for the "extra_vm_bash_script" ?
I'm hoping the console output will tell us
Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.
Thanks SarcasticSparrow10 !
I'll later reply the Github issue (for better visibility)
But my initial thoughts:
(1) I think this was suggested, and hopefully we will get to implementing it, I can definitely see the value. Meanwhile you can achieve some of the functionality with the experiment table and custom columns π
(2) "Don't display the performance metric" -> isn't that important? what am I missing?
(3) Hmm you mean just extra columns?
(4) sounds like a bug
(5) is this a plotly issue?...
Hi PompousParrot44
What do you have in the Execution/"script path" ?
I'm getting lot of bizarre errors running without a docker image attached
I think there is a mix in terminology
ClearML Agent can run in two different modes:
- virtual env - where it create a new venv for every Task executed
- docker mode- where it spins a docker as Base environment, then inside the docker (in real time) it will fetch the code, install missing python packages etc.There is no need to build a specific docker container, for example you can use the "python:3.10-bullseye" d...
Hi @<1697056701116583936:profile|JealousArcticwolf24> just saw the reply
Image look okay?! what what is the query? basically I'm truing to understand if grafana is connected to the Prometheus, and if the Prometheus has any data in it
Secondly, just to make sure, kafka service should be able to connect directly to the the container running the actual inference
Scheduled training is what Iβm looking forward to
I'll try to remember to update here once we pushed into the GitHub repo, feedback is always appropriated π
If in the next two weeks you hear nothing, please ping here to make sure I did not forget π
it's in the docker image, doesn't the git clone command run in the container
Then this should have worked.
Did you pass in the configuration: force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L25
Then you have to pass the .ssh into the remote server, probably the easiest is to have it in the "extra bash script"
Hi @<1547028116780617728:profile|TimelyRabbit96>
Trying to do model inference on a video, so first step in
Preprocess
class is to extract frames.
Basically this depends on the RestAPI, usually would will be sending a link to data to be processed and returned Synchronously
What you should have a custom endpoint doing the extraction, send Raw data into another endpoint doing the model inference, basically think "pipeline" end points:
[None](https://github.com/allegro...
The wheel you download from pip, for example this one torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl
is actually both CPU and cuda 117
@<1523701066867150848:profile|JitteryCoyote63>
I just created a new venv and run
pip install "torch==1.11.0.*" --extra-index-url
Then started python:
import torch
torch.cuda.is_available()
And I get True
what are you getting?
Thanks a lot. I meant running a bash script after cloning the repository and setting the environment
Hmm that is currently not supported π
The main issue in adding support is where to store this bash script...
Perhaps somewhere inside clear ml there is an order of actions for starting that can be changed?
Not that I can think of,
but let's assume you could have such a thing, what would you have put in the bash script (basically I want to see maybe there is a worka...
seems like pip 20.1.1 has the issue, but >= 22.2.2 do not.
Notice we changed the value there, it now has two versions, pne for python 3.10 < and one for python 3.10>=
The main reason is that pip changed their resolving algorithm, and the new one can break its own dependencies (i.e. pip freeze > requirements.txt -> pip install might not actually work)
None
I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
So I tested the "old" code that did the parsing and matching, and it did resolve to the correct wheel (i.e. found that there is no 117 only 115 and installed this one)
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt?
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117.
The thing is, the agent used to do all the heavy parsing because pytorch never actually had a pip compatible artifactory
But now they do, so the agent basically passed the parsing to pip and just added the correct additional pytorch pip repo.
It seems we need to switch back... wdyt?
Hello guys, i have 4 workers (2 in default and 2 in service queue on same machine)
Hi @<1526734437587357696:profile|ShaggySquirrel23>
I think what happens is one agent is deleting it's cfg file when it is done, but at least in theory each one should have it's own cfg
One last request can you try with the agent's latest RC version 1.5.3rc2 ?
can you get the agent to execute the task on the current conda env without setting up new environment?
Wouldn't that break easily ? Is this a way to avoid dockers, or a specific use case ?
is there any other way to get task from the queue running locally in the current conda env?
You mean including cloning the code etc. but not installing any python packages ?
Hi LovelyHamster1
That is a good point, I think the safest / robust way is to configure both to use the same dns name/s so both (internal/external) are accessible.
Some background, the URL itself on the artifact is basically a standalone, once registered on the Task, the UI will not replace it but use it as is (The UI has no "understanding" on which server it is, it will just fetch the file).
Are you also using a diff port on the load balancer ?
(because the easiest fix is on your external ...
So was definitely related to the symlinks in some form
could it be it actually deleted the cache? How many agents are running on the same machine ?
Hmm, Notice that it does store sym links to parent data versions (to save on multiple copies of the same file). If you call get_mutable_local_copy() you will get a standalone copy
Hi FunnyTurkey96
Which pip are you using, basically pip changed the dependency resolver after 20.1
Change: https://github.com/allegroai/clearml-agent/blob/aede6f4bac71c8fc56e7cf982318a48527953a3c/docs/clearml.conf#L57pip_version: "<20.2"
See if that helps
π
I'm trying to create a task that is not in repository root folder.
JuicyFox94 If the Task is not in a repo folder, you mean in a remote repository right ?
This means the repo should be in the form of " https://github.com/ " or "ssh://"
It failed in deducing this is a remote repository (maybe we can improve the auto detection?!)
Yep, found it, the --name is marked as required and the argparser throws an error ...
I'll make sure this is fixed as well π
Nice guys! Notice that the clearml-task can auto add the Task.init call on the fly, so you can connect any arbitrary Task and control the argparser arguments (again as parameters to the cleaml-task)
BTW: A fix for the --task-type Issue will be pushed later today π
Hmm good point, it should probably return he clearml python version. Is this what you mean?
maybe you can check alsoΒ
--version
Β that returns the helm menu
What do you mean? --version on cleaml-task ?
My bad "ssh://" prefix it not valid, let me try and see why it fails deducing this is a remote repo