Hi @<1523701601770934272:profile|GiganticMole91> , how is the task being stopped in your case? Is it aborted via the web UI or through some other method? Is the task running via the agent?
Non-AWS S3-like services (e.g. MinIO):
:port/bucket
Hi @<1792726992181792768:profile|CloudyWalrus66> , from a short read on the docs it seems simply as a way to spin up many machines with many different configurations with very few actions.
The autoscaler spins up and down regular ec2 instances and spot instances automatically by predetermined templates. Basically making the fleet 'feature' redundant.
Or am I missing something?
Python 2 is no longer supported, I'd suggest finding an AMI that already has python3 built in (Or install it using the init script, not suggested though) and also CUDA enabled to avoid that installation to support cuda images
Hi @<1670964680270548992:profile|SuperiorOctopus47> , you can manually create experiments and log metrics into them via the REST API - None
You basically have some older runs on your tensorboard that you want to import to ClearML?
Hi FancyTurkey50 , how did you run the agent command?
I suggest watching the following videos to get a better understanding:
Agent - None
Autoscaler - None
Also please review agent docs - None
when a task is enqueued when does the autoscaler kicks in?
You're looking for the polling interval parameter as mentioned in the documentation - [None](https://clear.ml/docs/latest/docs/webapp/appl...
Try running the following script
from clearml import Task
import time
task = Task.init(output_uri="
")
print("start sleep")
time.sleep(20)
print("end sleep")
Please add the logs
Hi @<1535069219354316800:profile|PerplexedRaccoon19> can you please elaborate on the issue?
Also can you provide the configuration of the autoscaler? You can export it through the webUI just make sure to scrape off any credentials
Hi @<1523701083040387072:profile|UnevenDolphin73> , not in the open source
I'm not sure, will check 🙂
How did you add the parameters to the pipeline? Did you refer to this example?
None
Hi @<1736194540286513152:profile|DeliciousSeaturtle82> , basically all the data is stored in /opt/clearml/data
as long as you migrate that to the input of the k8s deployment you should be good.
Hi @<1784754456546512896:profile|ConfusedSealion46> , in that case you can simply use add_external_files to the files that are already in your storage. Or am I missing something?
You can specify specific package versions yourself via code
https://clear.ml/docs/latest/docs/references/sdk/task#taskadd_requirements
@<1719524641879363584:profile|ThankfulClams64> , are logs showing up without issue on the 'problematic' machine?
Hi @<1523701523954012160:profile|ShallowCormorant89> , I think you can simply spin down all the containers and copy everything in /opt/clearml/
Hi @<1571308003204796416:profile|HollowPeacock58> , do you have a standalone code snippet that reproduces this behavior?
Hi @<1836213542399774720:profile|ConvincingDragonfly85> , I believe you're looking for the alias
parameter of Dataset.get()
- None
I guess that's a good point but really applicable if your training is CPU intensive. If your training is GPU intensive I guess most of the load goes on the GPU so running over VM (EC2 instances for example) shouldn't have much of a difference but this is worthy of testing.
I found this article talking about performance
https://blog.equinix.com/blog/2022/01/04/3-reasons-why-you-should-consider-running-containers-on-bare-metal/
But it doesn't really say what the difference in performance is...
Hi @<1571308003204796416:profile|HollowPeacock58> , do you have a self contained code snippet that reproduces this?
Hi @<1533619716533260288:profile|SmallPigeon24> , can you provide a snippet that reproduces this? Do you have some more information? What do you mean skip it?
You would also need to edit the links somehow that are connected to the task
ResponsiveHedgehong88 you can try mapping out the /tmp/ folder inside the docker outside for later inspection so the data wouldn't be lost. This could give us a better idea of what's happening
Can you provide the full log?
Hi @<1813745484821434368:profile|SuccessfulPigeon84> , what do you see in the log?
Hi @<1691258549901987840:profile|PoisedDove36> , did you do all the db migrations during the upgrade or did you go straight to 1.5 form 1.0?