Reputation
Badges 1
25 × Eureka!Maybe it's the Azure upload that has a weird size bug?!
I'm assuming you cannot directly access port 10022 (default ssh port on the remote machine) from your local machine, hence the connection issue. Could that be?
Hmm two questions: 1. How come it did not detect the packages when you were running the original task manually? 2. Could it be the poetry manager option is not working correctly?! Can you verify the venv is created with all packages? If so can you post the full log?
I want to be able to compare scalars of more than 10 experiments, otherwise there is no strong need yet
Make sense, in the next version, not the one that will be released next week, the one after with reports (shhh do tell anyone 🙂 ) , they tell me this is solved 🎊
Add '/' , like you would with a file system.Task.init(project_name='main_project/sub_project', task_name='test')
MysteriousBee56 that is very strange definitely explains it, kudos on debugging it !!!
tried it and restarted the agent, but not working properly
What do you mean not working? can you provide logs ?
What do you have under the "installed packages" section? Also you can configure the agent to use poetry to restore the environment (instead of pip)
Hi @<1578193384537853952:profile|MoodyOx45>
I have a task A that creates another task B via subprocess.
So the thing about the agent, when it runs the code, there is only One task to rule them all. basically any fork/spawn of subprocess will automatically be logged as the parent Task
I think that what you want is to build a pipeline from those Tasks? Or create a Task and enqueue it manually directly from Task A?
(btw: you can forcefully cause the subprocess to create it's own Task b...
I mean , the python package, not the trains-server version
(Venv mode makes sense if running inside a container, if you need docker support you will need to mount the docker socket inside)
What is exactly the error you re getting from clearml? And what do you have in the configuration file?
Hi JealousParrot68
I'll try to shed some light on these modules and use cases.
Storagemanager is general speaking, low level access to http/object-storage/files utility. In most cases there is no need to directly use it if objects are already stored/managed on clearml (for example artifacts/models/datasets). But, it is quite handy to use with your S3 buckets etc.
Artifacts: Passing an artifact between Tasks will usually be something like:
` artifact_object = Task.get_task('task_id').artifa...
JitteryCoyote63
I agree that its name is not search-engine friendly,
LOL 😄
It was an internal joke the guys decided to call it "trains" cause you know it trains...
It was unstoppable, we should probably do a line of merchandise with AI 🚆 😉
Anyhow, this one definitely backfired...
@<1533619716533260288:profile|SmallPigeon24> , failed task should not actually be reused (i.e. cached), are you saying a failed Task is being reused? or are you saying that you want to "invalidate" the cache in the execution but still leave the Task as completed ?
Here this new entry in the log is 2 min after env completed =>
1702378941039 box132 DEBUG 2023-12-12 11:02:16,112 - clearml.model - INFO - Selected model id: 9be79667ca644d7dbdf26732345f5415
This seems to be something in your code, just add print("starting") in your entry python file, Before any imports (because they might actually do something)
Because form the agent's perspective after printing Starting Task Execution:
it literally calls the python script, nothing else...
Hi @<1524560082761682944:profile|MammothParrot39>
By default you have the last 100 iterations there (not sure why you are only seeing the last 3), but this is configurable:
None
So basically development on a "shared" GPU?
I'm glad it worked out, thanks SmallBluewhale13 🙂
Thanks FlutteringWorm14 , checking 🙂
This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless r...
Then try to add the missing apt packages
extra_docker_shell_script: ["apt-get install -y ???", ]
SubstantialElk6 This seems to be the issuecp: failed to access '/root/default_clearml.conf': Permission denied clearml_agent: ERROR: Could not find task id=024a421c0e174650a1c7ff64af756c26 (for host: )
Notice it seems it just cannot read the clearml.conf
, wdyt?
I can see that the data is reloaded each time, even if the machine was not shut down in between.
You can verify by looking into the Task's Log, it will contain all the docker arguments, one of them should be the cache folder mount
BeefyCow3 On the plot itself click on the json download button
I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?
Yes this is fully supported and very easy to setup.
Regrading limiting users usage. This is doable, I think the easiest solution both for users and management of the cluster is introducing priority into the queue, basically a user can push job into low priority, and only some users can push into high...
But every agent is a different pod so I do not know how properly share the folder with images.
Can I conclude Kubernetes running the agents ?
What we would like ideally, is a system where development, training, and deployment are almost one and the same thing, to reduce the lead time from development code to production models.
This is very aligned with the goals of ClearML 🙂
I would to understand more on what is currently missing in ClearML so we can better support this approach
my inexperience in using them a lot until recently. I can see how that is a better solution
I think I failed in explaining my self, I me...