I think you are correct, this is odd, let me check ...
You mean the entire organization already has Kubeflow, or to better organize something (if this is the second, what are we organizing, pipelines?)
The confusion matrix shows under debug sample, but the image is empty, is that correct?
I will TIAS, but maybe worthwhile to also mention if it has to be the absolute path or if relative path is fine too!
Good point! (absolute but you can use ~, and I "think" also $ENV )
this results at the end of an experiment in an object to be saved under a given name regardless if it was dynamic or not?
Yes, at the end the name of the artifact is what it will be stored under (obviously if you reuse the name you basically overwrites the artifact)
This is the reason you are getting an error π
Basically the session asks the agent to setup a new SSH server with credentials on the remote machine, this is not an issue inside a container, as this is an isolated environment, but when running in venv mode the User running the agent is not root, hence it cannot spin/configure an SSH server.
Make sense ?
This part is odd:SCRIPT PATH: tmp.7dSvBcyI7mHow did you end with this random filename? how are you running this code?
GreasyPenguin14 let me check with the guys when is the next version .
Are you using the self-hosted server of the community server ?
You might be able to also find out exactly what needs to be pickled using theΒ
f_code
Β of the function (but that's limited to C implementation of python).
Nice!
However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
and all the files are registered in the metadata? coulf you add --verbose to the sync command to see what it is doing
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure
This is also odd, it should Not flatten the folder structure. What is your OS / Python / clearml version?
Is this reproducible ? if so, how ...
Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and in...
(This code sample should work on your setup with your installed packages without a problem)
Hi SkinnyPanda43
cannot schedule new futures after interpreter shutdown
This seems like a strange exception...
What's the setup here ? jupyter notebook ? how is the interpreter down ?
I think non-master processes trying to log something, but have no Logger instance because have no Task instance.
Hmm is your code calling Logger.current_logger() directly ?
Logs in master process include all training history or I need to concatenate logs from different nodes somehow?
So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML support...
SparklingHedgehong28 this is actually quite cool! Still not sure why not just use the built in autoscaler https://github.com/allegroai/clearml/tree/master/examples/services/aws-autoscaler , but it is a really cool usage of ASG π€©
I have to specify the full uri path ?
No it should be something like " s3://bucket "
the model files management is not fully managed like for the datasets ?
They are π
Hi GrievingTurkey78
I think the main issue is the lack of support for jsonargparse , is that correct ?
(vanilla pytorch lightning is using argpraser, which seems to work out of the box)
but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
How so?
Thanks CleanPigeon16
Could you verify Task "d1d361d1059c4f0981200f59d7683773" exists (and not archived)?
ShaggyHare67 in the HPO the learning should be (based on the above):General/training_config/optimizer_params/Adam/learning_rate
Notice the "General" prefix (notice it is case sensitive)
Hi DeliciousBluewhale87
clearml-agent 0.17.2 was just release with the fix, let me know if it works
Yes, which looks like a lot, but you only need to d that once.
Auto scheduler would make (1) redundant (as it would spin the instance up/down based on the jobs in the queue)
WackyRabbit7 this is funny, it is not ClearML providing this offering
some generic company grabbed the open-source and put t there, which they should not π
Hi StickyWhale51
I think this issue is due to some internal race condition, anyhow I think we have an RC out solving it, can you try with:pip install clearml==1.2.0rc2
Hurray conda.
Notice it does include cudatoolkit , but conda ignores it
cudatoolkit~=11.1.1
Can you test the same one only serach and replace ~= with == ?
These are maybe good features to include in ClearML:
or
.
Sure, we should probably add a section into the doc explaining how to do that
Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
I have to admit that to me this is a much better solution (then my/bento integrated JWT option). Generally speaking I think this is the best approach, it separates authentication layer from execution ...