Reputation
Badges 1
2 × Eureka!Now worries! Just so I understand fully though: you were already using the patch with success from my branch. Now that it has been merged into transformers main branch you installed it from there and that's when you started having issues with not saving models? Then installing transformers 4.21.3 fixes it (which should have the old clearml integration even before the patch?)
Damn it, you're right 😅
# Allow ClearML access to the training args and allow it to override the arguments for remote execution
args_class = type(training_args)
args, changed_keys = cast_keys_to_string(training_args.to_dict())
Task.current_task().connect(args)
training_args = args_class(**cast_keys_back(args, changed_keys)[0])
Thanks! I'm checking now, but might take a little (meeting in between)
Hi @<1523701949617147904:profile|PricklyRaven28> just letting you know I still have this on my TODO, I'll update you as soon as I have something!
An update: using your code (the snippet above) I was getting no scalars when simply installing ultralytics and clearml packages using pip. Because indeed tensorboard is not installed. When I do install tensorboard, I get metrics in like normal, so I can't seem to reproduce the issue when tensorboard is correctly installed. That said, maybe we should look at not having this dependency 🤔
Would you mind posting a pip freeze of your environment that you're using to run yolo?
Just for reference, the main issue is that ClearML does not allow non-string types as dict keys for its configuration. Usually the labeling mapping does have ints as keys. Which is why we need to cast them to strings first, then pass them to ClearML then cast them back.
It's been accepted in master, but was not released yet indeed!
As for the other issue, it seems like we won't be adding support for non-string dict keys anytime soon. I'm thinking of adding a specific example/tutorial on how to work with Huggingface + ClearML so people can do it themselves.
For now (using the patch) the only thing you need to be careful about is to not connect a dict or object with ints as keys. If you do need to (e.g. ususally huggingface models need the id2label dict some...
One more thing: are you running the snippet inside a jupyter notebook (Wondering this because you have Jupyter in your environment)
@<1558986839216361472:profile|FuzzyCentipede59> Would you mind sharing how you're running the training? i.e. a minimal code example so we can reproduce the issue?
Interesting! I'm glad to know it's working now, only I now really want to know what caused it 😄 Let me know if you ever do find out!
In order to prevent these kinds of collisions it's always necessary to provide a parent dataset ID at creation time, so it's very clear which dataset and updated one is based on. If multiple of them happen at the same time, they won't know of each other and both use the same dataset as the parent. This will lead to 2 new versions based on the same parent dataset, but not sharing data with each other. If that happens, you could create a 3rd dataset (potentially automatically) that can have bot...
Hi ReassuredTiger98 !
I'm not sure the above will work. Maybe I can help in another way though: when you want to set agent.package_manager.system_site_packages = true
does that mean you have a docker container with some of the correct packages installed? In case you use a docker container, there is little no real need to create a virtualenv anyway and you might use the env var CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1
to just install all packages in the root environment.
Because ev...
I can see 2 kinds of errors:Error: Failed to initialize NVML
and Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
These 2 lines make me think something went wrong with the GPU itself. Chances are you won't be able to run nvidia-smi
this looks like a non-clearml issue 🙂 It might be that triton hogs the GPU memory if not properly closed down (doubl ctrl-c). It says the driver ver...
Hi NuttyCamel41 !
Your suspicion is correct, there should be no need to specify the config.pbtxt
manually, normally this file is made automatically using the information you provide using the command line.
It might be somehow silently failing to parse your CLI input to correctly build the config.pbtxt
. One difference I see immediately is that you opted for "[1, 64]"
notation instead of the 1 64
notation from the example. Might be worth a try to change the input for...
I'm using image and machine image interchangeably here. It is quite weird that it is still giving the same error, the error clearly asked for "Required 'compute.images.useReadOnly' permission for 'projects/image-processing/global/images/image-for-clearml'"
🤔
Also, now I see your credentials even have the role of compute admin, which I would expect to be sufficient.
I see 2 ways forward:
- Try running the autoscaler with the default machine image and see if it launches correctly
- Dou...
Hmm, I can't really follow your explanation. The removed file SHOULD not exist right? 😅 And what do you mean exactly with the last sentence? An artifact is an output generated as part of a task. Can you show me what you mean with screenshots for example?
Hi Oriel!
If you want to only serve an if-else model, why do you want to use clearml-serving for that? What do you mean by "online featurer"?
Pipelines! 😄
ClearML allows you to create pipelines, with each step either being created from code or from pre-existing tasks. Each task btw. can have a custom docker container assigned that it should be run inside of, so it should fit nicely with your workflow!
Youtube videos:
https://www.youtube.com/watch?v=prZ_eiv_y3c
https://www.youtube.com/watch?v=UVBk337xzZo
Relevant Documentation:
https://clear.ml/docs/latest/docs/pipelines/
Custom docker container per task:
https://...
Hi Adib!
I saw this question about the datastores before and it was answered then with this:Redis is used for caching so it's fairly 'lightly' used, you don't need many resources for it. Mongo is for artifacts, system info and some metadata. Elastic is for events and logs, this one might require more resources depending on your usage.
Hope it can already help a bit!
Hi William!
1 So if I understand correctly, you want to get an artifact from another task into your preprocessing.
You can do this using the Task.get_task()
call. So imagine your anomaly detection task is called anomaly_detection
it produces an artifact called my_anomaly_artifact
and is located in the my_project
project you can do:
` from clearml import Task
anomaly_task = Task.get_task(project_name='my_project', task_name='anomaly_detection')
treshold = anomaly_ta...
Ok I check 3: The commandclearml-serving --id <your_id> model add --engine triton --endpoint "test_model_keras" --preprocess "examples/keras/preprocess.py" --name "train keras model" --project "serving examples" --input-size 1 784 --input-name "dense_input" --input-type float32 --output-size -1 10 --output-name "activation_2" --output-type float32
should be
` clearml-serving --id <your_id> model add --engine triton --endpoint "test_model_keras" --preprocess "examples/keras/preprocess.py" ...
I tried answering them as well, let us know what you end up choosing, we're always looking to make clearml better for everyone!
As long as your clearml-agents have access to the redis instance it should work! Cool usecase though, interested to see how well it would work 🙂
I'm not quite sure what you mean here? From the docs it seems like you should be able to simply send an HTTP request to the localhost url to get the metrics. Is this not working for you? Otherwise, all the metrics end up in Prometheus, so you can also query that instead or use something like Grafana to visualize it
Hi @<1546303293918023680:profile|MiniatureRobin9> !
Would you mind sending me a screenshot of the model page (incl the model path) both for the task you trained locally as well as the one you trained on the agent?
With what error message did it fail? I would expect it to fail, because you finalized this version of your dataset by uploading it 🙂 You'll need a mutable copy of the dataset before you can remove files from it I think, or you could always remove the file on disk and create a new dataset with the uploaded one as a parent. In that way, clearml will keep track of what changed in between versions.
Hello!
What is the usecase here, why would you want to do that? If they're the same dataset, you don't really need lineage, no?
Ok, so I recreated your issue I think. Problem is, HPO was designed to handle more possible combinations of items than is reasonable to test. In this case though, there are only 11 possible parameter "combinations". But by default, ClearML sets the maximum amount of jobs much higher than that (check advanced settings in the wizard).
It seems like HPO doesn't check for duplicate experiments though, so that means it will keep spawning experiments (even though it might have executed the exact s...
Hi ExuberantParrot61 ! Can you try using a wildcard? E.g. ds.remove_files(dataset_path='folder_to_delete/*')
It part of the design I think. It makes sense that if we want to keep track of changes, we always build on top of what we already have 🙂 I think of it like a commit: I'm adding files in a NEW commit, not in the old one.