Reputation
Badges 1
25 × Eureka!that's the entire repo link ? not something like https://github.com/ ... ?
Hi @<1523702969063706624:profile|PoisedShark13>
However, INSTALLED PACKAGES of my task is misses many of installed packages (any idea why?)
It automatically detects the directly imported packages, literally analyzing your code base and looking for imports
The derivative packages (i.e. the one that any of the "main" packages need, will be listed after the first time the agent installs everything)
If something specific is missing, you can manually add it with:
Task.add_requiremen...
Hi PanickyMoth78
` torch.save(net.state_dict(), PATH) # auto-uploads to GCS
get all the models from the Task
output_models = Task.current_task().models["output"]
get the last one
last_model = output_models[-1]
set meta-data
last_model.set_metadata(key="my key", value="my value", type="str") `
HighOtter69
Could you test with the latest RC? I think this fixed it:
https://github.com/allegroai/clearml/issues/306
The problem is that I currently don't have a way to get them "from outside".
Maybe as a hack (until we add the model object)
` class MyModelCB:
current_args = dict()
@classmethod
def callback(load_save, model_info):
if load_save != "save":
return model_info
model_info.name = "my new name" + str(current_args) # make a name from args
return model_info
WeightsFileHandler.add_pre_callback(MyModelCB.callback)
MyModelCB.current_args = {"args": "value"} `wdyt?
Hi RoughTiger69
but still get the semantics of knowing when an (external) file changed?
How would you know it changed?
This implies you have a way to verify hash, which means you download the data , no?
That works AND the feature works!
YEY
Quick follow up question, is there any way to abort a pipeline and all of the tasks it ran?
Hmm yes currently if you abort the pipeline is has no "time" to abort the running Tasks (the DAG itself will stop, because the pipeline controller was aborted, bit the running Tasks will continue).
In order to have better support, we need to add a previously requested feature for "abort" callback. This is actually not as straight forward as it sound...
If i point directly to the data.yaml the training starts without any problem
what do you mean? how do you know where the extracted file is?
basically:
data_path = Dataset.get(...).get_local_copy()
then you should be able to open your file with open(data_path + "/data.yaml", "rt")
doe that work?
LovelyHamster1 what do you mean by "assume the permissions of a specific IAM Role" ?
In order to spin an ec2 instance (aws autoscaler) you have to have correct credentials, to pass those credentials you must create a key/secret pair to pass to the autoscaler. There is no direct support for IAM Role. Make sense ?
Hmmm maybe
I thought that was expected behavior from poetry side actually
I think this is the expected behavior, hence bug?!
There is a git issue for selecting "pip freeze" / auto analyze, we could add "use requirements.txt"
wdyt?
Docker cmd is basically docker image name but you can add parameters as well.
For example "Nvidia/cuda" or "Nvidia/cuda -v /mnt/data:/mnt/data"
Hmm, maybe the original Task was executed with older versions? (before the section names were introduced)
Let's try:DiscreteParameterRange('epochs', values=[30]),Does that gives a warning ?
hen, in the bash console, after some time, I see some messages being logged from clearml
JitteryCoyote63 Hmm that is strange, let me check something
when I run it on my laptop...
Then yes, you need to set the default_output_uri on Your laptop's clearml.conf (just like you set it on the k8s glue)
Make sense ?
Hi SkinnyPanda43
Yes, I think you are right the documentation might be missing it. I'll make sure they know it 🙂
In the meantime :task.update_output_model
https://github.com/allegroai/clearml/blob/d3929033c016476c580557639ff44f900e65904a/clearml/backend_interface/task/task.py#L734
FYI matplotlib imshow will create a debug image, and on complex plots the plot might get converted to image. (But shown under the plots section). All in all you might not be aware of it, but you are uploading image to your files server
Hi SubstantialElk6
The ClearML session ended up tunneling into the physical machine that my agent is running on,
Yes that is the correct behavior. basically the clearml-session is using the agent to "schedule" a machine, then spin a container with JupyterLab/VSCode , and finally connect your CLI directly with that machine.
You can think of it as a way to solve the resource allocation problem.
Make sense ?
what is the best approach to update the package if we have frequent update on this common code?
since this package has an indirect affect on the model endpoint, I would package with the preprocess code of the endpoint.
Each server is updating it's own local copy, and it will make sure it can take it and deploy it hand over hand without breaking its ability to serve these endpoints.
the "wastefulness" of holding multiple copies is negligible when comparing to a situation where everyone ...
trains-agent doesn't run the clone, it is pip...
basically calling "pip install git+https://..."
Not sure you can pass extra arguments
Also, this is not a setup problem, otherwise it would have seen consistently failing ... this actually looks like a network issue.
The only thing I can think of is retrying to install if we get network error (not sure whats the exit code of pip though (maybe 9?)
So you want these two on two different graphs ?
SuperiorDucks36 you mean to manually set an experiment (and the dummy Task is just a way to have an entry to configure), do I understand you correctly ?
Following on that, we are thinking of doing it all for you with a CLI , that will basically create a task from a code/repo you already have on your machine. What do you think?
You mean one machine with multiple clearml-agents ?
(worker is a unique ID of an agent, so you cannot have two agents with the exact same worker name)
Or do you mean two agents pulling from the same queue ? (that is supported)
Is there a way to document these non-standard entry points
@<1541954607595393024:profile|BattyCrocodile47> you should see the "run" in the Args section under Configuration
in case of HF you should see "-m huggingface" and then the rest in the Args section
(if this does not work, then I assume this is a bug 🙂 )
The idea is of course that you can always enqueue and reproduce, so if that part is broken we should fix it 😊
ClumsyElephant70 yes there is 🙂clearml-agent build --id <task id> --target <folder>(I might have a typo there, but you can basically check the full help clearml-agent build --help )
Ssh is used to access the actual container, all other communication is tunneled on top of it. What exactly is the reason to bind to 0.0.0.0 ? Maybe it could be a flag that you, but I'm not sure in what's the scenario and what are we solving, thoughts?
Let me know if I can be of help 🙂
Hi GiddyPeacock64
If you already have K8s setup, and are already using ClearML.
In your kubeflow Yaml:trains-agent execute --id <task_id> --full-monitoringThis will install everything your Task needs inside the docker. Just make sure that you pass the env variable setting the ClearML , see here:
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L127