Hi @<1559711623147425792:profile|PlainPelican41> , you can re-run an existing pipeline using different parameters from the UI. Otherwise, you need to create new pipelines with new code 🙂
I guess that's a good point but really applicable if your training is CPU intensive. If your training is GPU intensive I guess most of the load goes on the GPU so running over VM (EC2 instances for example) shouldn't have much of a difference but this is worthy of testing.
I found this article talking about performance
https://blog.equinix.com/blog/2022/01/04/3-reasons-why-you-should-consider-running-containers-on-bare-metal/
But it doesn't really say what the difference in performance is...
@<1556812486840160256:profile|SuccessfulRaven86> , I think this is because you don't have the proper permissions 🙂
Hi @<1523701260895653888:profile|QuaintJellyfish58> , can you please provide a standalone snippet that reproduces this?
from src.net import Classifier
ModuleNotFoundError: No module named 'src'
Hi @<1632913959445073920:profile|IratePigeon23> , please look at the following thread - None
That is a nice example for using the API. After you handle the login issues, you can use the web UI as a reference for the API (use dev tools - F12 to see what the UI sends to the backend).
Let me know if this helps 🙂
Hi AbruptWorm50 ,
After cloning the experiment you can actually edit the installed packages and specify which package version you want.
You can also do this via code using this method:
https://clear.ml/docs/latest/docs/references/sdk/task#taskadd_requirements
Hi @<1570583237065969664:profile|AdorableCrocodile14> , is it possible you have some models inside?
Pending means it is enqueued. Check to which queue it belongs by looking at the info tab after clicking on the task :)
With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue
Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely()
- None
Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?
I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf
or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job
Maybe even make a PR out of it if you want 🙂
How are you launching the agents?
BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines
Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)
Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective
And you use the agent to set up the environment for the experiment to run?
What version of clearml
/ clearml-agent
are you using? Are you running in docker mode? Can you add your agent command here?
Can you compare the installed packages between the original experiment to the cloned one? Do you see anything special or different between the two?
VexedCat68 Hi 🙂
Please try with pip install clearml==1.1.4rc0
Hi @<1543766544847212544:profile|SorePelican79> , ClearML can certainly do that. For this you have the Datasets feature.
None
This will allow you to version and track your data super easily 🙂
Why go into the environment variable and not just state it directly?
task = Task.init(
project_name="my_project",
task_name="my_task",
output_uri="
"
)
Hi @<1523721697604145152:profile|YummyWhale40> _, what if you specify the output_uri
through the code in Task.init()
?
Hi @<1688721797135994880:profile|ThoughtfulPeacock83> , can you add a standalone script that reproduces this?
That's a good question. In case you're not running in docker mode, the agent machine that runs the experiment needs to have Cuda/Cudnn installed. If you're running in docker mode you need to select a docker that already has those installed 🙂
I think the serving engine ip depends on how you set it up
JitteryCoyote63 , heya, yes it is :)
You can save the entire folder as an artifact.
UnevenDolphin73 , that's an interesting case. I'll see if I can reproduce it as well. Also can you please clarify step 4 a bit? Also on step 5 - what is "holding" it from spinning down?
Hi @<1523701083040387072:profile|UnevenDolphin73> , looping in @<1523701435869433856:profile|SmugDolphin23> & @<1523701087100473344:profile|SuccessfulKoala55> for visibility 🙂