PompousParrot44 please try to reply on the thread, so we do not create a mess in the main channel 🙂
What's the "working directory" in the execution section? Do you have package "test" in the installed packages?
PompousParrot44
Check out the task.execute_remotely()
You can call it right after the task init, and it will enqueue your running Task, and leave the process (if you want).
https://github.com/allegroai/trains/blob/65a4aa7aa90fc867993cf0d5e36c214e6c044270/trains/task.py#L1437
Hi @<1600299043865497600:profile|MagnificentSeaurchin90>
Any chance you can provide more info on the error?
if I want to compare two experiments the scalar plots do not load ( loading forever ).
I'm assuming the issue is the Plots tab? or is it the Scalars? what do you have in the Plots? can you send an image of the single experiment ?
Hi CurvedHedgehog15
Yes you are correct, plots are displayed side-by-side in the ui. The reason is that since they are very generic, it is very challenging to actually be able to merge / overlay two arbitrary plots.
I can see two options
- To allow user to combine two plots in the ui (this way the responsibility is on the user to understand this is possible
- Maybe add programmatic interface to more easily access the raw data?
Wdyt?
Check the log to see exactly where it downloaded the torch from. Just making sure it used the right repository and did not default to the pip, where it might have gotten a CPU version...
This code will give you one graph titled "loss" with two series: (1) trains (2) loss
Lol yeah Hydra is great. Notice you still have the ability to override Hydra from the UI so you really have the best of the two worlds
Ad1. yes, think this is kind of bug. Using _task to get pipeline input values is a little bit ugly
Good point, let;s fix it 🙂
new pipeline is built from scratch (all steps etc), but by clicking "NEW RUN" in GUI it just reuse existing pipeline. Is it correct?
Oh I think I understand what happens, the way the pipeline logic is built, is that the "DAG" is created the first time the code runs, then when you re-run the pipeline step it serializes the DAG from the Task/backend.
Th...
making me realize that this may have been optional
I think it is optional, and this is why it was not entered in the first place.
Could you double check and just remove it from your manual pbtxt ?
I had again the same problem but within a remote pipeline setup.
Are you saying the ussue is not fixed? can you verify the pipeline & pipeline components are using the at least 1.104rc0 version?
Hi @<1658281099807166464:profile|SmallCamel52>
Lack of authentication in all versions of the fileserver component
Are you leaving the fileserver open to the world ?
Thanks @<1523703472304689152:profile|UpsetTurkey67>
I'm pretty sure it has!
Let me check how we can merge it into the cleamrl-agent, sounds good?
Hi HandsomeGiraffe70
First:# During pipeline initialisation pipeline_params is empty and we need to use default values. # When pipeline start the run, params are lunched again, and then pipeline_params can be used.
Hmm that should probably be fixed, maybe a function on the pipeline to deal with it ?
When I reduce tune_optime value to just 'recall'. Pipeline execution failed with msg:
ValueError: Node 'tune_et_for_Precision', base_task_id is empty
.
I would...
Woo, what a doozy.
yeah those "broken" pip versions are making our life hard ...
Yes actually that might be it. Here is how it works,
It launch a thread in the background to do all the analysis of the repository, extracting all the packages.
If the process ends (for any reason), it will give the background thread 10 seconds to finish and then it will give up. If the repository is big, the analysis can take longer, and it will quit
logger.report_scalar(title="loss", series="train", iteration=0, value=100)
logger.report_scalar(title="loss", series="test", iteration=0, value=200)
Just so I understand,
scheduler executes main every 60sec
main spins X sub-processes
Each subprocess needs to report scalars ?
the first runs perfectly fine,
Just making sure, running in an agent?
the second crashes
Running inside the same container as the first one ?
Yeah I think this kind of makes sense to me, any chance you can open a GH issue on this feature request?
However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
and all the files are registered in the metadata? coulf you add --verbose
to the sync command to see what it is doing
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure
This is also odd, it should Not flatten the folder structure. What is your OS / Python / clearml version?
Is this reproducible ? if so, how ...
Hi @<1523702932069945344:profile|CheerfulGorilla72>
This is a property on the Model object
model.published
Not sure why we do not have it here...
None
(I'll ask them to fix that)
@<1671689437261598720:profile|FranticWhale40> I might have found something, let me see if I can reproduce it
And you want all of them to log into the same experiment ? or do you want an experiment per 60sec (i.e. like the scheduler)
Hi @<1730396272990359552:profile|CluelessMouse37>
However, the caching doesn't seem to be working correctly. Despite not changing the configuration, the first step runs every time.
How are you creating the cached component?
is this a standalone script or a git repo link?
These parameters are dictionaries of specific configurations (dict of dict) that are the same but might not be taken into account properly by the caching mechanism.
hmm for the component to be cached (or reuse...
using the docker-compose file for the
clearml-serving
pipeline, do we also have to mount it somehow?
oh yes, you are correct the values are passed using environment variables (easier when using docker compose)
You can in addition add a mount from the host machine to a conf file,
volumes:
- ${PWD}/clearml.conf:/root/clearml.conf
wdyt?
It’s only on this specific local machine that we’re facing this truncated download.
Yes that what the log says, make sense
Seems like this still doesn’t solve the problem, how can we verify this setting has been applied correctly?
hmm exec into the container? what did you put in clearml.conf?
Okay we got to the bottom of this. This was actually because of the load balancer timeout settings we had, which was also 30 seconds and confusing us.
Nice!
btw:
in the clearml.conf we put this:
for future reference, you are missing the sdk section:
sdk.http.timeout: 300
.
notation works as well as {}
I can install clearml and clearml-agemt and run the worker inside a docker
oh I see, you should install it inside a docker, then mount the docker socket so it can spin sibling containers , ans lastly make sure the mounts are correct with this env variable:
None
Hi @<1747428509627715584:profile|CumbersomeDuck6>
but is it possible to use ClearML in Rust, without writing a wrapper.
With the RestAPI you can...
noticed the API doesnt cover dataset operations but the CLI can.
Yes the CLI will fetch/create datasets for you,
wdyt?