Reputation
Badges 1
25 × Eureka!GiganticTurtle0 this one worked for me 🙂
` from clearml import Task
from clearml.automation.controller import PipelineDecorator
@PipelineDecorator.component(return_values=["msg"], execution_queue="myqueue1")
def step_1(msg: str):
msg += "\nI've survived step 1!"
return msg
@PipelineDecorator.component(return_values=["msg"], execution_queue="myqueue2")
def step_2(msg: str):
msg += "\nI've also survived step 2!"
return msg
@PipelineDecorator.component(return_values=["m...
If i point directly to the data.yaml the training starts without any problem
what do you mean? how do you know where the extracted file is?
basically:
data_path = Dataset.get(...).get_local_copy()
then you should be able to open your file with open(data_path + "/data.yaml", "rt")
doe that work?
I didn't realise that pickling is what triggers clearml to pick it up.
No, pickling is the only thing that will Not trigger clearml (it is just too generic to automagically log)
Hi UnsightlyLion90
from my understanding agent do the job of SLURM,
That is kind of correct (they overlap in some ways 🙂 )
Any guide of how to integrate both of them?
The easiest way is to just add the "Task.init()" call to your code, and use SLURM to schedule the job. this will make sure all jobs are fully logged (this can also includes automatically uploading the models, and artifact support etc)
Full SLURM support (i.e. similar to the k8s glue support), is currently ou...
well it should fail, but I think the error message should be fixed 🙂
maybe:ValueError: dataset 'tmp_datset' not found in project
lavi-testing' `wdyt?
Sure thing!
BTW: not sure if it helps but the SaaS version integrates with Genesis Cloud I know they provide cheap GPUs might be worth checking
So how can I temporarily fix it?
Try:task.output_uri = task.get_output_destination()
ChubbyLouse32 and this works when running python code and not when the agent is running ?
On the same machine ?
ClearML does not work easily with Google Drive.
Yes, google drive is not google storage (which ClearML supports 🙂 )
Seems like you solved it?
Hi PompousParrot44
Let's stick with a single question per thread, it will make my life a lot easier 🙂
What do you mean by "and not in the terminal directly when executed manually through script"?
trains-agent (usually) executed as a daemon pulling jobs and executing them.
The other options is to use it to manually execute a single task.
What am I missing?
Hi ShortElephant92
No, this is opt-in, so other then checking for updates once in a while, no traffic at all
It can also work by running on multiple known nodes.
Horovod sits on top of openmpi that needs ssh to open multiple nodes, I'm not sure how one would connect it without passing the SSH keys from one node to the other, and making sure they can directly communicate. (Not saying it is not possible, but just a few things to configure before it works, the enterprise edition remove the need for the direct SSH connection between the nodes)
How would i add a glue for multinode?
Basic...
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restart
Did you try the above, and you are still getting the same error ?
the optimizer such that the study object of the optimizer keeps track of the results and the next sample will be aware of all previous studies
This is done from the optimizer side, by sampling the scalars reported by any experiment the optimizer created.
I am looking for a way to manually sample and report from and to the optimizer...
.. I can avoid running unnecessary common heavy setup, for a light weight experiment
Maybe it makes sense to inherit from the Optimizer and add ...
I have to commit the YAML with my AWS credentials to git.
CleanPigeon16 please do not 🙂
either put them on the Task itself, or as OS env on the machine/agent running the Task.
Regrading where it is stored (I think the default is DevOps project, need to look at the code)
FileNotFoundError: [Errno 2] No such file or directory
Could it be the file you are trying to run is not in the repository ?
Are you running inside a docker ?
Any chance you can send the full log ?
EFS get downloaded to the k8 pod local volume?
EFS is an Amazon service that mounts a persistent FS into ec2 instances, I believe they have support for k8s as a service as well, which would make it kind of like a PV only as a service.
Does that make sense ?
BoredHedgehog47 that actually depends on the container, are you running as root inside the container ?
if not I think the easiest hack is to always map /etc/hosts as a k8s secret file?
Hmm interesting...
of course you can do:dataset._task.connect(...)
But maybe it should be public?!
How are you using that (I mean in the context of a Dataset)?
You can make reports on experiments with interactive graphs
Yes, I can totally see how this is a selling point. The closest is the Project Overview (full markdown capabilities, with the ability to embed links to specific experiments). You can also add a "leader metric", so you can track the project performance/progress over time.
I have to admit that creating a better reporting tool is always pushed down in priority as I think this is a good selling point to management but the actual ...
but it still not is able to run any task after I abort and rerun another task
When you "run" a task you are pushing it to a queue, so how come a queue is empty? what happens after you push your newly cloned task to the queue ?
Hi BroadSeaturtle49torchvision!=0.13.0,>=0.8.1
is this what you have in the requirements ?
The clearml-agent is parsing the requested version and tries to match it to the version found/supported by the installed cuda
There is the possibility the combinarion wither does not exist or fore some reason the parsing (i.e. clearml-agent's parsing) fails
can you maybe provide the Task's full log?
Basically you should not use Task.create to log the current execution. It is used to create a Task externally and then enqueue it for remote execution. Make sense?
Are you saying you had that odd script entry-point created by calling Task.init? (To clarify this is the problem)
Btw after you clone the experiment you can always manually edit both entry point and working dir, which based on what you said should be "script.py" and "folder"
It runs directly but leads to the above error with clearml
Both manually (i.e. calling Task.init and running it without agent, and with agent ? same exact behavior ?
No worries, you should probably change it to pipe.start(queue= 'queue')
not start locally
s it working when you are calling it with start locally ?
Please attach the log 🙂
GrievingTurkey78 did you open the 8008 / 8080 / 8081 ports on your GCP instance (I have to admit I can't remember where exactly in the admin panel you do that, but I can assure you it is there :)