Reputation
Badges 1
25 × Eureka!K8s can schedule pod with different priorities.
I'm not sure I agree here, could you refer me to the docs on this ability in k8s ?
So maybe no real scheduling means there is no ClearML scheduling after applying pod to k8s.
That is correct š
Does it will implement in the future?
Yes, this is enterprise feature, in the community you can specify --max-pods limit (which will cause it never to pull a job if it hits the max-pod limit)
Hi RoundMosquito25
Hi, are there available somewhere examples of testing in ClearML? For example unit tests that check if parameters are passed correctly to new tasks etc.?
What do you mean by "testing in ClearML" ?
For example unit tests that check if parameters are passed correctly
Passed where / how? Are we thinking agents here ?
With k8s glue going, want to finally look at clearml-session and how people are using it.
If used with k8s glue, you will have to run the glue with --ports-mode, then the clearml session will know how to connect to container itself, since at runtime the container will register the gateway + port for the learml-session client to connect to
JitteryCoyote63 so now everything works as expected ?
VexedCat68 yes š you can also pass the parent folder and it will zip the entire subfolders into a single artifact
Thanks for answering, Yes, this is exactly what I wanted
Hmm should be possible, how slow is the update that we want to save the time ?
Your git execution needs this file, just like your machine does, to know where the server is and how to authenticate. You have to Manually pass it to your git action.
WackyRabbit7
Cool - so that means the fileserver which comes with the host will stay emtpy? Or is there anything else being stored there?
Debug Images and artifacts will be automatically stored to the file server.
If you want your models to be automagically uploaded add the following :task=Task.init('example', 'experiment', output_uri=' ')(You can obviously point it to any other http/S3/GS/Azure storage)
I'm trying to queue a task in python but I'd like to reuse the prior task ID.
is it your own Task? i,,e, enqueue yourself, if this is the case use task.execute_remotely it will do just that.
If this is another Task, then if it is aborted then you can just enqueue it, by definition it will continue with the Same Task ID.
try to break it into parts and understand what produces the error
for example:increase(test12_model_custom:Glucose_bucket[1m])increase(test12_model_custom:Glucose_sum[1m])increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m])
and so on
DefeatedOstrich93 what do you mean by "I am wondering why do I need to create files before applying diff ?"git diff will not list files unless their are added (they are marked as "untracked") think temp files logs etc. until you add a file to git it will basically ignore that file. Make sense ?
DeliciousBluewhale87 fyi, the new version of the pipeline (hopefully pushed towards the end of this week), will allow you to more easily write steps as functions (not only as Tasks, as the current implementation)
Also check the new Trigger and Scheduler both intended to trigger these pipelines:
https://github.com/allegroai/clearml/blob/fe3c481c37e70881c44d67c1cf9bbce00a84747e/clearml/automation/scheduler.py#L457
https://github.com/allegroai/clearml/blob/fe3c481c37e70881c44d67c1cf9bbce00a8...
not sure if this is considered a bug or not! but Iād happily make an issue on github if needed.
I think we should, at least for the sake of transparency and visibility š
thanks again for all your help.
My pleasure š
So, what I am referring to is the ability of a system to allow some rigor and robustness of tracking of experiments, and also enforcing some thoughts on how things might be deployed, early on in the development process, whilst not being overly prescriptive and cumbersome
I'm cannot agree more!!
VivaciousPenguin66 We are working on trying to better understand how to solve this very delicate act of balance and offer some sort of "JIRA" for ML.
If this is okay with you, once product pe...
@<1523701083040387072:profile|UnevenDolphin73> it's looking for any of the files:
None
I think you are correct and the first time you spin the server it is not possible (I mean you need it up to get the access/secerey and only then you can insert them into the helm values) ... š
LOL š
Make sure that when you train the model or create it manually you set the default "output_uri"
task = Task.init(..., output_uri=True)
or
task = Task.init(..., output_uri="s3://...")
GaudyPig83
I think there is some mismatch between the code creating the pipeline and the actual Task?! Could that somehow be the case? "relaunch_on_instance_failure" is a missing argument somehow
can you try to launch the entire Pipeline with the latest RC ?pip3 install clearml==1.7.3rc0
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
I start the TaskScheduler, register a task, and stop the scheduler, how do I restart the TaskScheduler in a way that re-register the tasks?
if it's aborted, just re-enqueue it?
(it serializes itself and stores it's state on the Task object, so when re-launched it will deserialize from the last state)
I think you have it on the workers and queues page when you click on the worker you have its detials
HandsomeCrow5 I see, my bad.
BTW: Did you see this one?
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
And the helper classes here: https://github.com/allegroai/trains/tree/master/trains/automation
HugeArcticwolf77 I think this issue was resolved with the latest version 1.8.0, can you try to rerun the entire pipeline with the latest version?
thought the agent created a new conda env and installed all packages
It does, but I was asking what is written on the Original Task (the one created when you executed the code on your laptop, not when the agent was executing it, when the agent is executing the Task, it writes back All the packages of the entire venv it created, when the Task is run manually, it will list only the packages you import directly (i.e. from package or import package, it actually analyses the code)
My point...
Hi @<1742355077231808512:profile|DisturbedLizard6>
the problem maybe in returning None in get_local_model_file()
This tracks, it means that the model file cannot be downloaded for some reason,
when you click on the model here: None
what doe sit say under "MODEL URL:"?
.replace('file://', '', 1)
I have mounted my s3 bucket at the location /opt/clearml/data/fileserver/ but I can see my data is not being stored in s3 but its storing in ebs. How so?
I'm assuming the mount was not successful
What you should see is a link to the files server inside clearml, and actual files in your S3 bucket