Reputation
Badges 1
25 × Eureka!Hmm... That's what happens with the exception of None/'' if type is str... There is no way to differentiate in the UI.
This is why we opted for type=str will "cast" everything to str so you always get str, while not specifying a type will leave the variable as is... If you have an idea on how to support both, feel free to suggest 🙂
Just run once (from your python console / pycharm etc.):
https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py
Hi OutrageousSheep60
Do you mean something like:
https://github.com/allegroai/clearml/tree/master/examples/datasets
?
We are working on 1.3.0 so this is right in time
Hmm @<1523701083040387072:profile|UnevenDolphin73> I think this is the reason, None
and this means that even without a full lock file poetry can still build an environment
Hi @<1554275802437128192:profile|CumbersomeBee33>
what do you mean by "will the dependencies will be removed or not" ?
The next time the agent spin a new Task it will create a new venv and delete the previous one
Hi DepressedChimpanzee34
Why do you need to have the configuration added manually ? isn't the cleaml.conf easier ? If not I think OS environments are easier no? I run run above code, everything worked with no exception/warning... What is the try/except solves exactly ?
Hi DeliciousKoala34
Happened when cloning and running a task on an agent on a different machine. I
sounds like torch internal issue, can you send the full log of the remote Task ?
Hi SkinnyPanda43
No idea what the ImageId actually is.
That's the ami image string that the new EC2 will be started with, make sense ?
Hi @<1544853695869489152:profile|NonchalantOx99>
I would assume the clearml-server configuration / access key is misconfigured in your copy of example.env
Hi LackadaisicalOtter14
Is it possible to remove this line to stop it from being executed
Everything is possible 🙂 II think the main question is why it is there (which ti the best of my understanding, is to solve for any cuda drivers and installed packages, meaning anything that is installed in runtime)
I think we can suppress the error, wdyt?'echo "ldconfig" 2>/dev/null >> /etc/profile && '
Both are fully implemented in the enterprise version. I remember a few medical use cases, and I think they are working on publishing a blog post on it, not sure. Anyhow I suggest you contact the sales people and I'm sure they will gladly setup a call/demo/PoC.
https://allegro.ai/enterprise/#contact
Hi, I changed it to 1.13.0, but it still threw the same error.
This is odd, just so we can make the agent better, any chance you can send the Task log ?
Check the log, the container has torch 1.13.0 but the task requires torch==1.13.1
Now torch package inside those nvidia prepackaged containers are compiled a bit differently . What I suspect happens is the torch wheel from pytorch is not compatible with this container . Easiest fix , change the task requirments to 1.13
Wdyt ?
btw: both should work fine
Hi @<1542316991337992192:profile|AverageMoth57>
is this a follow up of this thread? None
BTW: Can you also please test with the latest clearml version , 1.7.2
Hi @<1523701949617147904:profile|PricklyRaven28>
Sorry, we missed that one
we need to invoke it with
accelerate launch
so we use
subprocess.run
So you have two options, either you change the script entry of the Task from your " script.py " to" -m accelerate launch script.py
or you manually do that inside your entry point (i.e. call accelerate launch)
BTW, I "think" we added an "auto detect" for it, so that if you launched it manually this wa...
BTW
/home/local/user/.clearml/venvs-builds/3.7/bin/python: can't open file 'train.py': [Errno 2] No such file or directory
This error is from the agent, correct? it seems it did not clone the correct code, is train.py committed to the repository ?
Container environment setup overhead?
It may have been killed or evicted or something after a day or 2.
Actually the ideal setup is to have a "services" pod running all these service on a single pod, with clearml-agent --services-mode. This Pod should always be on and pull jobs from a dedicated queue.
Maybe a nice way to do that is to have the single Task serialize itself, then have the a Pod run the Task every X hours and spin it down
So I would like to to know what it send to the server to create the task/pipeline, ...
WickedGoat98 are you running the agent with --gpus ?
I would say 4vCPUs and 512GB storage , but it really depends on the load you will put on it
I'm assuming those errors are from the triton containers? where you able to run the simple pytorch mnist example serving from the repo?
Exactly! nice 🎉
I have timeseries dataset with dimension 1,60,1 which the first dimension is number of data, the second one is timestep
I think it should be --input-size 1 60 ` if the last dimension is the batch size?
(BTW: this goes directly to Triton configuration, it is the information Triton needs in order to run the model itself)
DefiantHippopotamus88 you can create a custom endpoint and do that, but it will be running I the same instance , is this what you are after? Notice that Triton actually supports it already, you can check the pytorch example