Reputation
Badges 1
25 × Eureka!Thank you so much!! 🤩
TrickyRaccoon92 actually Click is on the to do list as well ...
However, there is still a delay of approximately 2 minutes between the completion of setup,
Where is that delay in the log?
(btw: it seems your container is missing clearml-agent & git, installing those might add some time)
GrievingTurkey78 Actually it is in progress, see the GitHub issue for details:
https://github.com/allegroai/trains/issues/219
GrievingTurkey78 yes, you are correct on both.
Will the sweep functionality work?
Yes it should, that said, it will not use the trains-agent
so you are limited to the machine running the sweep.
If you want to do HPO on multi-node, checkout this example 🙂
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
So if any step corresponding to 'inference_orchestrator_1' fails, then 'inference_orchestrator_2' keeps running.
GiganticTurtle0 I'm not sure it makes sense to halt the entire pipeline if one step fails.
That said, how about using the post_execution callback, then check if the step failed, you could stop the entire pipeline (and any running steps), what do you think?
Here this new entry in the log is 2 min after env completed =>
1702378941039 box132 DEBUG 2023-12-12 11:02:16,112 - clearml.model - INFO - Selected model id: 9be79667ca644d7dbdf26732345f5415
This seems to be something in your code, just add print("starting") in your entry python file, Before any imports (because they might actually do something)
Because form the agent's perspective after printing Starting Task Execution:
it literally calls the python script, nothing else...
Okay so my thinking is, on the pipelinecontroller / decorator we will have:abort_all_running_steps_on_failure=False
(if True, on step failing it will abort all running steps and leave)
Then per step / component decorator we will havecontinue_pipeline_on_failure=False
(if True, on step failing, the rest of the pipeline dag will continue)
GiganticTurtle0 wdyt?
This is odd because the screen grab point to CUDA 10.2 ...
that is because my own machine has 10.2 (not the docker, the machine the agent is on)
No that has nothing to do with it, the CUDA is inside the container. I'm referring to this image https://allegroai-trains.slack.com/archives/CTK20V944/p1593440299094400?thread_ts=1593437149.089400&cid=CTK20V944
Assuming this is the output from your code running inside the docker , it points to cuda version 10.2
Am I missing something ?
Hmmm could you attach the entire log?
Remove any info that you feel is too sensitive :)
Hi ClumsyElephant70
So do you need both requirements.txt combined ?
How will the agent be able to reproduce both repo on the remote machine ?
I see, by default it will look for requirements.txt in the root of the repo (the actual repo).
That said in code you can specify the requirements .txt:Task.force_requirements_env_freeze(requirements_file='repo/project-a/requirements.txt') task = Task.init(...)
Notice, you need to call it prior to the Task.init call
Hi @<1547028074090991616:profile|ShaggySwan64>
I have to admit that personally I do not know pdm
, could you share links, and help us understand what is the value over pip/poetry/conda ?
Task.current_task().connect(training_args, name='hugggingface args')
And you should be able to change them when launching remotely 😉
SmallDeer34 btw: "set_parameters_as_dict" will replace all the arguments (and is one way) ...
Hi Team, I'm currently trying to install ClearML-Server on a Powerpc server with RedHat7.
You are a brave man LividCrab90 !
s there dockerfiles for the ClearML-Server stack somewhere ?
The main issue is replacing the DB containers, do you have elastic/mongo/redis for powerpc ?
Then in theory (since the backend is python based) you just need to find a base docker image to build it on.
do I need to create a brand new dataset with a new name that inherits from the original?
Yes, you just create a new version, specify the parent one, add changes and close it.
If you later need you can squash a version (same ides as git squash). Make sense ?
although ideally i'd like to tell it exactly where to unzip it.
Ohh you can use .get_local_mutable_copy()
It will unzip it to specific folder
Hmm, so currently you can provide help, so users know what they can choose from, but there is no way to limit it.
I know the Enterprise version has something similar that allows users to create a custom "application" from a Task, there you can define a drop and as such, but that might be an overkill here, wdyt?
I'm not sure I'm the right person to answer that, but yes my understanding is that this is a Scale/Enterprise tier feature, at least for the time being.
UnevenDolphin73 go to the profile page, I think at the bottom right corner you should see it
(Also ctrl-F5 to reload the web application, if you upgraded the server 🙂 )
. Looking at this example here, it looks like it only works with tasks:
Aha! Pipeline is a Task 🙂 (a specific type of Task, nonetheless a Task)
Just use the pipeline ID, and make sure you push it into the services queue, voila
Okay, I'm pretty sure there is a hack, let me see if there is something "nicer"
think perhaps it came across as way more passive aggressive than I was intending.
Dude, you are awesome for saying that! no worries 🙂 we try to assume people have the best intention at heart (the other option is quite depressing 😉 )
I've been working on a Azure load balancer example, ...
This sounds exciting, let me know if we can help in any way
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything.
Sure man 🙂 no rush, I appreciate the gesture regardless of the outcome
Many thanks!
In Azure VMSS, there is a method called "Custom Data", which is basically a way of passing things to be executed
I know that it is in the to do list to add "azure_autoscaler" which is basically asybling to the aws_autoscaler.
With the same idea of the "custom data" as initial bash script:
You can check here:
https://github.com/allegroai/clearml/blob/4a2099b53c09d1feaf0e079092c9e075b43df7d2/clearml/automation/aws_auto_scaler.py#L54