Hi AdventurousRabbit79
Try:"extra_clearml_conf" : "aws { s3 {key: A, secret : B, region: C, }} ",Generally speaking no need for the quotes on the secret/key
You also need the comma to separate between keys.
You can test if it is working by adding the same string to your local clearml.conf and importing the cleaml package
So I assume, trains assumes I have nvidia-docker installed on the agent machine?
docker + nvidia-docker-runtime are assumed to be installed
nvidia/cuda docaker image is pulled when requested (like any other container image)
Moreover, since I'm going to use
Task.execute_remotely(and not through the UI) is there any code way to specify the docker image to be used?
Sure, task.set_base_docker(docker_cmd='nvidia/cuda -v /mnt:/tmp')
Notice that you can not only pass the dock...
Hi ConvolutedSealion94
Yes this seems like the correct curl
How did you spin the clearml-serving containers? is it with the docker-compose or with the helm chart (I remember that there are some pitfalls with the helm chart, and I would actually start with the local docker-compose to debug it)
Hi DilapidatedCow43
I'm assuming the returned object cannot be pickled (which is ClearML's way of serializing it)
You can upload it as a model with
` uploaded_model_url = Task.current_task().update_output_model(model_path="/path/to/local/model")
...
return uploaded_model_url `wdyt?
EmbarrassedSpider34
Sync_folder and upload
Several times along the code and then
Do notice they overwrite one another...
SourLion48 you mean the wraparound ?
https://github.com/allegroai/clearml/blob/168074acd97589df58436a3ec122a95a077620c2/docs/clearml.conf#L33
BroadMole98 as one can expect long answer as well ๐
I have a workflow with 19000 job nodes in it.
wow, 19k job nodes? as in a single pipeline 19k steps?
The main idea of the trains-agent is to allow multi-node workloads, and creating pipelines on top of a scheduler without worrying about docker packaging (done automatically for you), and to have a proper scheduler with priority (that is missing from k8s)
If the first step is just "logging" all the steps, you can easily add "Task...
Correct (with the port mapping service in it)
So what is the difference ? both running from the same machine ?
Hi RoughTiger69
How about using the pipeline decorator as a way to run this logic?
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I think I'm missing the context of where the code is executed....
btw: you can now set the configuration_objects directly when calling add_step ๐
https://clearml.slack.com/archives/CTK20V944/p1633355990256600?thread_ts=1633344527.224300&cid=CTK20V944
At the moment I'm querying by paging through the tasks as you recommended, and then filtering with standard python list-comprehension filters...Which is less than ideal.
At least let's do that better:
Use Task._query_tasks:Task._query_tasks(order_by=['-started'], page_size=10, page=0, only_fields=['id', 'started'])You will get "lighter" objects returned, then you can filter them with code (but the request will be a lots faster)
SuccessfulKoala55 any suggestion on improving that ?
ShakyJellyfish91 can you check if version 1.0.6rc2 can find the changes ?
Hi AntsySeagull45
Any chance the original code was running with python2?
Which version of trains-agent are you using?
LazyTurkey38 notice the assumption is that the docker entry-point ends with bash, and only then the agent take charge. I'm assuming this is not te case hence the agent spins the docker, then the docker just ends, could that be?
SourOx12
Hmmm. So if last iteration was 75, the next iteration (after we continue) will be 150 ?
the hack doesn't work if conda is not installedย
Of course conda needs to be installed, it is using a pre-existing conda env, no?! what am I missing
Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
And the assumption is the code is also there ?
Hmm good point, it should probably return he clearml python version. Is this what you mean?
Hi ReassuredTiger98
but I would rather just define a function that returns the task directly
๐
Check it out:
https://github.com/allegroai/clearml/blob/36ee3d61209e413a917d8a718fb25f389143cfa1/clearml/automation/controller.py#L205:param base_task_factory: Optional, instead of providing a pre-existing Task, provide a Callable function to create the Task (returns Task object)
SmallDeer34
I think this is somehow related to the JIT compiler torch is using.
My suspicion is that JIT cannot be initialized after something happened (like a subprocess, or a thread).
I think we managed to get around it with 1.0.3rc1.
Can you verify ?
Hi StickyMonkey98
I'm (again) having trouble with the lack of documentation regarding Task.get_tasks(task_filter={STUFF}).
Yes we really have to add documentation there... Let me add that to the todo list
How do I filter tasks by time started? It seems there's a "started" property, and the web ui uses "started" as a key-word in the url query, but task_filter results in an error when I try that...Is there some other filter keyword for filtering by start-time??
last 10 started ...
Thanks ShakyJellyfish91 this really helps to narrow it down!
Let me see what I can find
UnevenDolphin73 following the discussion https://clearml.slack.com/archives/CTK20V944/p1643731949324449 , I suggest this change in the pseudo code
` # task code
task = Task.init(...)
if not task.running_locally() and task.is_main_task():
# pre-init stage
StorageManager.download_folder(...) # Prepare local files for execution
else:
StorageManager.upload_file(...) # Repeated for many files needed
task.execute_remotely(...) `Now when I look at is, it kinds of make sense to h...
BTW: StickyMonkey98 if you feel like writing a few examples I think it will be easy to push into the docs, so that at least we improve iteratively...
TightElk12 I think this message belongs to a diff thread ;)
Hi AstonishingWorm64
I think you are correct, there is external interface to change the docker.
Could you open a GitHub issue so we do not forget to add an interface for that ?
As a temp hack, you can manually clone "triton serving engine" and edit the container image (under the execution Tab).
wdyt?
Hi ScaryLeopard77
You can probably do:Task.init(...,continue_last_task='task_id_here')This will continue a previously executed Task and log both steps in the same place.
Does that help?
BTW: you can also of course manually report to any Task as it is still running with:aux_task = Task.get_task(task_id_here) aux_task.get_logger().report_scalar(...)
Actually this should be a flag
Hi UnsightlyLion90
from my understanding agent do the job of SLURM,
That is kind of correct (they overlap in some ways ๐ )
Any guide of how to integrate both of them?
The easiest way is to just add the "Task.init()" call to your code, and use SLURM to schedule the job. this will make sure all jobs are fully logged (this can also includes automatically uploading the models, and artifact support etc)
Full SLURM support (i.e. similar to the k8s glue support), is currently ou...