tried it and restarted the agent, but not working properly
What do you mean not working? can you provide logs ?
Another point I see is, that in the workers & queses view the GPU usage is not been reported
It should be reported, if it is not, maybe you are running the trains-agent
in cpu mode ? (try adding --gpus)
FierceFly22 wow that is a cool hack! Trains will capture any torch.save , so I think the actual driver here is the 'model.summary' . You can also upload any artifact with task.upload_artifact('name', 'modelsummary.txt')
Touching a file will not trigger Trains, as it does not monitor the files themselves. Make sense?
BTW, how will you get the file when running with the agent? If you are using the connect_configuration it will be downloaded from the trains-server for you. Otherwise you can alw...
Wait, it shows "hydra==2.5" not "hydra-core==x.y" ?
TrickySheep9
you are absolutely correct 🙂
PompousHawk82 unfortunately this is kind of binary, either you have full tracking of load/save operations or you do not.
This warning message will disappear in the next version as we will be able to log multiple models under the same Task :)
This is so odd,
could you add prints right after the Task.init?
Also could you verify it still gets stuck with the latest RC
clearml==1.16.3rc2
DAG which get scheduled at given interval and
Yes exactly what will be part of the next iteration of the controller/service
an example achieving what i propose would be greatly helpful
Would this help?from trains.automation import TrainsJob job = TrainsJob(base_task_id='step1_task_id_here') job.launch(queue_name='default') job.wait() job2 = TrainsJob(base_task_id='step2_task_id_here') job2.launch(queue_name='default') job2.wait()
Woot woot, great to hear 🎊
give me a minute to test
Did you experiment any drop of performances using forkserver?
No, seems to be working properly for me.
If yes, did you test the variant suggested in the pytorch issue? If yes, did it solve the speed issue?
I haven't tested it, that said it seems like a generic optimization of the DataLoader
The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
https://github.com/allegroai/trains/tree/master/examples/optimization/hyper-parameter-optimization
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
I see TightElk12
You can always setup the OS environments : CLEARML_API_HOST CLEARML_WEB_HOST CLEARML_FILES_HOST with the correct configuration Or you can simply set CLEARML_NO_DEFAULT_SERVER=1 which will prevent any usage of the default demo serverwdyt?
Hi SubstantialElk6
32 CPU cores, 64GB ram
Should be plenty, this sounds like network bottle neck issue, I can't imagine the server is actually CPU bounded
Parent makes sense if you are changing the data of the parent version, but some data is preserved. Which will make the delta-based storage only store the diff.
If everything is different, and you call sync
for example, then it will not reference any previous "snapshot", so there will be no redundancy in storage, but you still get a pointer to the "parent" version.
Make sense ?
(also could you make sure all posts regrading the same question are put in the thread of the first post to the channel?)
clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
If the user running this command can run "docker run", then you should ne fine
GreasyPenguin66 Nice !!!
Very cool setup, and kudos on making it work with multiple users!
Quick question, shouldn't the JUPYTERHUB_API_TOKEN env variable be enough to gain access to the server? Why did you need to add it to the 'nbserver-x.json' as well?
neat! please update on your progress, maybe we should add an upgrade section once you have the details worked out
Hi @<1684010629741940736:profile|NonsensicalSparrow35>
But the provided command is missing the url target for the curl so it is not complete.
Not sure I followed. did you specify "NEW_ADDRESS" ?
or is it the in both cases the URL is locahost ?
Here this new entry in the log is 2 min after env completed =>
1702378941039 box132 DEBUG 2023-12-12 11:02:16,112 - clearml.model - INFO - Selected model id: 9be79667ca644d7dbdf26732345f5415
This seems to be something in your code, just add print("starting") in your entry python file, Before any imports (because they might actually do something)
Because form the agent's perspective after printing Starting Task Execution:
it literally calls the python script, nothing else...
Hi @<1529633468214939648:profile|CostlyElephant1>
what seems to be the issue? I could not locate anything in the log
"Environment setup completed successfully
Starting Task Execution:"
Do you mean it takes a long time to setup the environment inside the container?
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL and CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL,
It seems to be working, as you can see no virtual environment is created, the only thing that is installed is the cleartml-agent that i...
SmugOx94 could you please open a GitHub issue with this request, otherwise we might forget 🙂
We might also get some feedback from other users
Are you saying this component should pull a specific git repo?PipelineDecorator.component( ..., )
seems like there is no reference to a specific repo (arguments repo
and repo_branch
etc are missing) is that correct?
I appended python path with /code/app/flair in my base image and execute
the python path is changing since it installs a new venv into the system.
Let me check what's going on with the pythonpath, because it is definitely is changed when running the code (the code base root folder is added to it). Maybe we need to make sure that if you had PYTHON PATH pre-defined we restore it.
Hi CostlyElephant1
What do you mean by "delete raw data"? Data is always fetched to cached folders and clearml takes care of cache cleanup
That said notice that get mutable copy is a target you specify, in this case you should definetly delete after usage. Wdyt ?
Ohh yes, if the execution script is not on git and git exists, it will not add it (it will add it if it is in a tracked file via the uncommitted changes section)
ZanyPig66 in order to expand the support to your case. Can you explain exactly which files are on git and which are not?