Reputation
Badges 1
25 × Eureka!The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
Good point!
. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Oh I see now....
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Quick question when you say the HPO Task, you mean the HPO controller logic Task...
Hmm @<1523701279472226304:profile|SoreHorse95> this is a good point, I think you are correct we need to fix that,
- Could you open a GitHub issue so this is not forgotten ?
- As a workaround I would use clone=True, then after the call I would call task.close() on the original task, wdyt?
Hi JumpyDragonfly13 , just making sure, do you have an agent running on a remote machine ?
Can you have a direct TCP connection to the remote machine (the default port it will use is 10022)
more like testing especially before a pipeline
Hmm yes, that makes sense.
Any chance you can open a github issue on it?
Let me see if I understand, basically, do not limit the clone on execute_remotely, right ?
When did this PipelineDecorator come. Looks interestingΒ
A few days ago (I think)
It is very cool! checkout the full object proxy interaction on the actual pipeline logic This might be better for your workflow, https://github.com/allegroai/clearml/blob/c85c05ef6aaca4e...
JitteryCoyote63 , just making sure, does refresh fixes the issue ?
Programmatically before , importing the package, set os.environ['TRAINS_CONFIG_FILE']='~/my_new_trains.conf'
BTW: What's the use case for doing so?
thanks for helping again
My pleasure :)
add_external_files
with a very large number of urls that are
not
in the same S3 folder without running into a usage limit due to the
state.json
file being updated
a lot
?
Hi ShortElephant92
what do you mean the state.json is updated a lot?
I think that everytime you call add_external_files
is updated, but
add_external_files ` can get a folder to scan, that would be more efficient. How are you using it ?
try:
import os
...
dataset_path = Dataset.get(
dataset_name=dataset_name,
dataset_project=dataset_project,
alias="0013_Dataset"
).get_local_copy()
dataset_path = os.path.join(dataset_path, "data.yaml")
...
(with matplotlib 3.2+ I get no warning, let me check with 3.1)
We created an account, setup our data pipeline, and now we can't get back in. Nothing is in the project. Can someone from support reach out to help?
Hi @<1545216077846286336:profile|DistraughtSquirrel81>
You mean in the SaaS? (app.clearml.ml) or is it a local installation?
If this is the SaaS, could it be the data is on a different workspace ? (you can switch workspace and refresh the page)
packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
Wait I'm confused, this is inside a container, no?
and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Generally speaking you are correct, but some packages will not have the same version for all python versions
Specifically in this case I think...
YummyWhale40 no idea what the pytorch-lighting guys did there. let me check a the actual code.
DepressedChimpanzee34
I might have an idea , based on the log you are getting LazyCompletionHelp
in stead of str
Could it be you installed hyrda bash completion ?
https://github.com/facebookresearch/hydra/blob/3f74e8fced2ae62f2098b701e7fdabc1eed3cbb6/hydra/_internal/utils.py#L483
Thanks JitteryCoyote63 let me double check if there is a reason for that (there might be one, not sure)
If I access the dataset on the same location directly it works fine:
wait, I'm confused, how is it the datset us there? did it download the dataset?
are you saying this line for example will fail? (assuming you actually have a dataset by that name)
data_path = Dataset.get(dataset_name="002_Datenset_MASAM_for_fintuning", alias="002_Datenset_MASAM_for_fintuning").get_local_copy()
would I have to execute each task in the pipeline locally(but still connected to trains),
Somehow you have to have the pipeline step Task in the system, you can import it from code, or you can run it once, then the pipeline will clone it and reuse it. Am I missing something ?
In Azure VMSS, there is a method called "Custom Data", which is basically a way of passing things to be executed
I know that it is in the to do list to add "azure_autoscaler" which is basically asybling to the aws_autoscaler.
With the same idea of the "custom data" as initial bash script:
You can check here:
https://github.com/allegroai/clearml/blob/4a2099b53c09d1feaf0e079092c9e075b43df7d2/clearml/automation/aws_auto_scaler.py#L54
Those variables are not passed to the remote instance they are used by the aws autoscaler to launch it, but there is no need to pass them.
I think the easiest is to add them to the "extra_vm_bash_script" as well
PompousParrot44
It should still create a new venv, but inherit the packages from the system-wide (or specific venv) installed packages. Meaning it will not reinstalled packages you already installed, but it will ive you the option of just replacing a specific package (or install a new one) without reinstalling the entire venv
"sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible.
Yeah I really like this idea... continuing this thread, would it also make sense to have a Task object per "sub-node" and run the sub-nodes as subprocess of the parent Node? I'm thinking this sounds like a combination of both local pipeline execution and remote pipeline execution.
wdyt?
This is sitting on top of the serving engine itself, acting a s a control plane.
Integration with GKE is being worked on (basically KFServing as the serving engine)
WackyRabbit7 this section is what you need, un mark it, and fill it in
https://github.com/allegroai/trains/blob/c9fac89bcd87550b7eb40e6be64bd19d4384b515/docs/trains.conf#L88
Yes this is Triton failing to load the actual model file
Hi DisgustedDove53
Now for the clearml-session tasks, a port-forward should be done each time if I need to access the Jupyter notebook UI for example.
So basically this is why the k8s glue has --ports-mode.
Essentially you setup a k8s service (doing the ingest TCP ports) then the template.yaml that is used by the k8s glue should specify said service. Then the clearml-session knows how to access the actual pod, by a the parameters the k8s glue sets on the Task.
Make sense ?
the latter is an ec2 instance
and the agent fails to install on the ec2 machine ?
ShinyWhale52 any time π
Feel free to followup with more questions
Hi ColossalAnt7 , I think we run into it on a few dockers, I believe the bug was fixed in the latest trains-agent
RC. Could you verify please ?
It will not create another 100 tasks, they will all use the main Task. Think of it as they "inherit" it from the main process. If the main process never created a task (i.e. no call to Tasl.init) then they will create their own tasks (i.e. each one will create its own task and you will end up with 100 tasks)