Reputation
Badges 1
25 × Eureka!BeefyHippopotamus73 are you saying that on a remote machine you cannot set AWS_PROFILE
? or is it the clearml.conf
is missing ? (not sure I follow how / who spins the remote machine)
I still think the issue is getting boto3 credentials
It might be the case
Are you using clearml-agent or are you running it manually ?
Hi BeefyHippopotamus73
. I checked the template task and the list of βInstalled Packagesβ indeed does not have one of my required packages in the list.
Basically the "installed packages" is auto populated based on the directly imported packages n your code base.
Could it be you do not have import snowflake-connector-python
and this is a derivative package (i.e. required from a different package)
BTW: when you clone your Task in the UI you can edit and add the missing packages,...
SarcasticSparrow10 sure see "execute_remotely" it does exactly that:
https://allegro.ai/docs/task.html#trains.task.Task.execute_remotely
It will stop the current process (after syncing everything) and launch itself remotely (i.e. enqueue itself)
When the same code is running by the "trains-agent" the execute_remotely call becomes a no-operation and is basically skipped
For example, ServerA stores file at /opt/clearml but ServeB stores at /some_path/clearml
As long as you adjust your docker-compose yaml file, should be just fine
VictoriousPenguin97 I'm assuming the exact same server version ?
Oh no, you are absolutely correct, it is broken (I mean I have no idea why it lists Hydra, or how it got there). I will let the guys know and fix it.
Bottom line, after you clone it, please edit the installed packages and remove the "Hydra" line and replace with just "hydra-core" (no need for version).
The format is the same as "requirements.txt" and will effect the venv created by the agent
The package detection is done when running the code on your laptop, and this is when it first logs the packages and versions. Following it, what do you have on your laptop? OS/Conda/Python
Wait, it shows "hydra==2.5" not "hydra-core==x.y" ?
It will also allow you to pass them to Hydra (wither as overloaded, or directly edit the entire hydra config)
Hi TroubledJellyfish71
What do you have listed on the Task's execution "installed packages" section ? (of the original Task) ?
How did it end up with an http link of pytorch ?
Usually it would be torch==1.11
...
EDIT:
I'm assuming the original Task was executed on a Mac M1, what are you getting when calling pip freeze
?
And where is the agent running ? (and is it venv or docker mode?)
Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:torch == 1.11.0+cu113
into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)
Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug
SarcasticSquirrel56 when the process dies (i.e. killed) it does not have time not update the state, then the server watchdog will set the state to aborted after X amount of time of inactivity (default is 2 hours)
You can definitely configure the watchdog to set the timeout to 15min, it should not have any effect on running processes, they basically ping every 30 sec alive message
Hi @<1567321739677929472:profile|StoutGorilla30>
Is it necessary to serve keras model using triton engine?
It is not, but it is the most efficient way to serve keras models, and this is why by default clearml-serving is using Nvidia Triton (we are talking 10x factors)
I would start with the keras example, see that it works and then work your way into your example (notice you always need to provide the layers form the in/out of the model)
[None](https://github.com/allegroai/clearml-s...
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
JumpyPig73 I think fire
was just added:
https://github.com/allegroai/clearml/pull/550
You can test with the latest RC:pip install clearml==1.2.0rc1
Regrading not finding Hydra-core package, what do you have listed under Execution: "Installed Packages" (it should have auto detected that you are importing hydra and list it there)
so I assume clearml moves them from one queue to the other?
Correct. When it creates the k8s job and launches it on the cluster it moves it into the queue.
Can you see it on your k8s cluster (meaning the job/pod)?
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
Nice!
But I can't really figure out why that would be the case...
So the thing is, the link to the files are generated by the clients, which means the actual code generated a link an internal link to the file server (i.e. a link that only works inside the k8s cluster). When you wanted to see the image/plot you were accessing it from outside the cluster, and the link simply ...
BTW: any specific reason for going the RestAPI way and not using the python SDK ?
Hi EnchantingOstrich20
You how doe s clearml get it there?
In runtime it analyzes the code you are running looking for imports then checks the version you have actively used (i.e. active venv / python) and lists it there.
You can also override those in code, or edit them after you clone the ask and before you enqueue it for remote execution
how come the previous gitdiff passed ?
hmm that would explain it failing
ConvolutedSealion94 if you do bash:cd ~/work/repo/code/ git status
what are you getting ?
yes, I do, I added a
auxiliary_cfg
and I saw it immediately both in CLI and in the web ui
How many Tasks do you see in the UI in DevOps project with the system Tag SERVING-CONTROL-PLANE
?
TBH our Preprocess class has an import in it that points to a file that is not part of the preprocess.py so I have no idea how you think this can work.
ConvolutedSealion94 actually you can add an entire folder as preprocessing, including multiple files
See example des...