Hmm, let me check, there is a chance the level is dropped when manually reporting (it might be saved for internal critical reports). Regardless I can't see any reason we could not allow to control it.
Let me check if we can reproduce it
WackyRabbit7 this is funny, it is not ClearML providing this offering
some generic company grabbed the open-source and put t there, which they should not ๐
Hmm can you run the agent in debug mode, and check the specific console log?
'''
clearml-agent --debug daemon --foreground ...
Did you you set 'force_git_ssh_protocol: true '?
https://github.com/allegroai/clearml-agent/blob/249b51a31bee97d63f41c6d5542e657962008b68/docs/clearml.conf#L39
Hmm so I guess the actual code adds it into the reporting itself ...
How about we call:task.set_initial_iteration(0)
I wonder if using our own containers which should have most the deps will work better than a simpler container.
Why not, it's transparent, just run in --docker mode and provide a default docker image if the Task doesn't specify one.
Hi @<1541954607595393024:profile|BattyCrocodile47>
Can you help me make the case for ClearML pipelines/tasks vs Metaflow?
Based on my understanding
- Metaflow cannot have custom containers per step (at least I could not find where to push them)
- DAG only execution. I.e. you cannot have logic driven flows
- cannot connect git repositories to different component in the pipeline
- Visualization of results / artifacts is rather limited
- Only Kubernetes is supported as underlying prov...
TroubledHedgehog16 if you have a preinstalled conda env then why would you need to reinstall it from yml file? Also if this is the default python env, clearml-agent will inherit from it and use i, (no real overhead there)
Notice the reason for "inheriting system" python environments is so that the agent could cache the individual Task requirements, meaning next time it will not need to reinstall anything
wdyt?
I'm trying to queue a task in python but I'd like to reuse the prior task ID.
is it your own Task? i,,e, enqueue yourself, if this is the case use task.execute_remotely
it will do just that.
If this is another Task, then if it is aborted then you can just enqueue it, by definition it will continue with the Same Task ID.
If i point directly to the data.yaml the training starts without any problem
what do you mean? how do you know where the extracted file is?
basically:
data_path = Dataset.get(...).get_local_copy()
then you should be able to open your file with open(data_path + "/data.yaml", "rt")
doe that work?
Thanks for the ping ConvolutedChicken69 , I missed it ๐
from what i see in the docs it's only for Jupyter / VS Code, i didn't see anything about pycharm
PyCharm is basically SSH, which is supported ๐
(Maybe we should mention it in the docs?)
Worker just installs by name from pip, and it installs not my package!
Oh dear ...
Did you configure additional pip repositories in the Agent's clearml.conf ? https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L77 It might be that (1) is not enough, as pip will first try to search the package in the pip repository, and only then in the private one. To avoid that, in your code you can point directly to an https of your package` Ta...
Does it work if I launch the clearml-agent on a docker and pip doesn't know the packages to install
Not sure I follow... the "detect_with_pip_freeze" flag (when set) will tell clearml (at runtime) to create the "installed packages" directly from pip freeze (instead of analyzing the code)
MoodyCentipede68 seems you did not pass any configuration (os env or conf file) so it does nor know how to find the server and authenticate. Make sense?
I want to call that dataset in local PC without downloading
when you say "call" what do you mean? the dataset itself is a set of files compressed and stored in the clearml file server (or on your S3 bucket etc.)
Hi SmoothSheep78
Do you need to import the previous state of the trains-server, or are you starting from scratch ?
Hi @<1555362936292118528:profile|AdventurousElephant3>
I think your issue is that Task supports two types of code,
- single script/jupyter notebook
- git repo + git diffIn your example (If I understand correctly) you have a notebook calling another notebook, which means the first notebook will be stored on the Task, but the second notebook (not being part of a repository) will not be stored on the task, and this is why when the agent is running the code it fails to find the second notebook....
Hmm reading this: None
How are you checking the health of the serving pod ?
Hiย SmoggyGoat53
There is a storage limit on the file server (basically 2GB per file limit), thisย is the cause of the error.
You can upload the 10GB to any S3 alike solution (or a shared folder). Just set the "output_uri" on the Task (either at Task.init or with Task.output_uri = " s3://bucket ")
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via python ...
or via torchrun ...
?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.di...
CheerfulGorilla72
yes, IP-based access,
hmm so this is the main downside of using IP based server, the links (debug images, models, artifacts) store the full URL (e.g. http://IP:8081/ http://IP:8081/... ) This means if you switched IP they will no longer work. Any chance to fix the new server to the old IP?
(the other option is somehow edit the DB with the links, I guess doable but quite risky)
Hi CharmingBeetle38
On the base task, do you see those arguments under the Configuration tab?
Also, if they are under Args section, you should add "Args/" prefix to the HP optimization (this is how you differentiate between the sections)
Hi ScantChimpanzee51
How are you launching the code ?
Basically the easiest way is to do so with the example you just mentioned,
Can this issue be reproduced ?
CurvedHedgehog15 there is not need for :task.connect_configuration( configuration=normalize_and_flat_config(hparams), name="Hyperparameters", )
Hydra is automatically logged for you, no?!
CharmingBeetle38 try adding "General/" before the arguments. This means batch_size becomes General/batch_size. This is only because we are accessing the parameters externally, when the task is executed it is resolved automatically
BTW: you can quite easily add an option to set the offline folder, check here:
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/trains/config/init.py#L31
PRs are always appreciated :)