Reputation
Badges 1
25 × Eureka!I see TrickyFox41 try the following:--args overrides="param=value"
Notice this will change the Args/overrides argument that will be parsed by hydra to override it's params
Task.connect is "automagic" i.e. to server when in Manual mode, from server in agent mode,
set_parameter is one way only and should be used to set an external Task's parameters.
Hi TrickySheep9
Long story short, clearml-session fully supports k8s (using k8s glue)
The --remote-gateway along side ports mode will basically allow you to setup a k8s service so that every session will register with a specific port so k8s does ingest foe you and route the SSH connection to the pod itslef, everything else is tunneled over the original SSH connection.
Make sense ?
Hi @<1643060801088524288:profile|HarebrainedOstrich43>
I think I understand what's going on, in order for the pipeline logic to be "aware" of the pipeline component, it needs to be declared in the pipeline logic script file (or scope if you will).
Try to import from src.testagentcomponent import step_one
also in the global pipeline script (not just inside the function)
I think they (DevOps) said something about next week, internal roll-out is this week (I think)
Makes sense, but this means that we are not able to tell clearml-agent where to save on a per-task basis?
The debug samples? or the artifacts/models?
Also it is not possible to use multiple files server? E.g. log tasks on different S3 buckets without changing clearml.conf
Yes, change the Task's output destination in the UI (or programmatically)
This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless r...
well from 2 to 30sec is a factor of 15, I think this is a good start π
DeliciousBluewhale87 you can try:
` import sqlite3
import pandas as pd
conn = sqlite3.connect('test_database')
sql_query = pd.read_sql_query ('''
SELECT
*
FROM products
''', conn)
sql_query.to_csv(...) `
There seems to be a problem with multiprocessing: Although I stopped the task,
You mean you "aborted the task" from the UI?
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
These are most certainly dataloader process. But clearml-agent when killing the process should also kill all subprocesses, and it might be there is something going on that prenets it from killing the subprocesses ...
Is this easily reproducible ? Can you verify it is still the case with the latest RC of clearml-agent ?
SarcasticSparrow10 how do I reproduce it?
I tried launching from a sub process that is a daemon and it worked. Are you using ProcessPool ?
LOL love that approach.
Basically here is what I'm thinking,
` from clearml import Task, InputModel, OutputModel
task = Task.init(...)
run this part once
if task.running_locally():
my_auxiliary_stuff = OutputModel()
my_auxiliary_stuff.system_tags = ["DATA"]
my_auxiliary_stuff.update_weights_package(weights_path="/path/to/additional/files")
input_my_auxiliary = InputModel(model_id=my_auxiliary_stuff.id)
task.connect(input_my_auxiliary, "my_auxiliary")
task.execute_remotely()
my_a...
Did you experiment any drop of performances using forkserver?
No, seems to be working properly for me.
If yes, did you test the variant suggested in the pytorch issue? If yes, did it solve the speed issue?
I haven't tested it, that said it seems like a generic optimization of the DataLoader
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
(It would be nice to have all the Pypi releases tagged in github btw)
I wanted to say, we listen ... and point to the tag , but for some reason it was not pushed LOL.
Have to get glue setup, which I couldnβt understand fully, so thatβs a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
Sounds good to me. DepressedChimpanzee34 any chance you can add a github feature request, so we do not forget to add it?
quick update 1.0.2 will be ready in an hour, apologies π
Correct the serving Task ID is the clearml serving session. It is the instance that holds all the information of this specific setup and models
ThickFox50 I also have to point that there is a free hosted server here π https://app.community.clear.ml
Hi PanickyMoth78
Hmm yes, I think the StorageManager (i.e. the google storage pythonclinet) also needs a json file with the credentials.
Let me check something
No worries, just found it. Thanks!
I'll make sure to followup on the GitHub issue for better visibility π
Hi SkinnyPanda43
Every "commit" is a new version, so sync changes you need to either create a new version (with parent version as the previous one), and sync the local folder (or manually add/remove files).
If you do not need to actually store the "current" version, you can just reset the Task, and sync it again.
wdyt?
Hi TrickyFox41
Hey since Hydra does not work with
clearml-task
I should shouldn't it? what does not work ?
Hmm TrickyRaccoon92 take a look at the cleanup service, I think you can hack it so instead of deleting the artifacts, it will archive them somewhere (also you can change the filter, maybe only perform on experiments with specific user tag)
What do you think?
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py