Reputation
Badges 1
2 × Eureka!I'm still struggling to reproduce the issue. Trying on my own PC locally as well as on google colab yields nothing.
The fact that you do get tensorboard logs, but none of them are captured by ClearML means there might be something wrong with our tensorboard bindings, but it's hard to pinpoint exactly what if I can't get it to fail like yours 😅 Let me try and instal exactly your environment using your packages above. Which python version are you using?
Hi @<1523701949617147904:profile|PricklyRaven28> sorry that this is happening. I tried to run your minimal example, but get a IndexError: Invalid key: 5872 is out of bounds for size 0
error. That said, I get the same error without the code running in a pipeline. There seems to be no difference between simply running the code and the pipeline (for me). Do you have an updated example, maybe also including getting a local copy of an artifact, so I can check?
Damn it, you're right 😅
# Allow ClearML access to the training args and allow it to override the arguments for remote execution
args_class = type(training_args)
args, changed_keys = cast_keys_to_string(training_args.to_dict())
Task.current_task().connect(args)
training_args = args_class(**cast_keys_back(args, changed_keys)[0])
Just for reference, the main issue is that ClearML does not allow non-string types as dict keys for its configuration. Usually the labeling mapping does have ints as keys. Which is why we need to cast them to strings first, then pass them to ClearML then cast them back.
No worries! And thanks for putting in the time.
effectively making us lose 24 hours of GPU compute
Oof, sorry about that, man 😞
As you can see in the issue comments, for now you can report it using iteration=0 and then adding the value as a field in the experiment overview (as leaderboard). This will give you a quick overview of your metrics per experiment in the main experiment list 🙂
I agree, I came across the same issue too. But your post helps make it clear, so hopefully it can be pushed! 🙂
The webUI won't calculate anything for you, so if you want to see accuracy, precision etc. you'll have to calculate those in your code and then report them as scalers for the UI to know what they are.
As you said, sklearn models don't work with iterations, so when using the logger to report the scalar, you can just set the iteration nr to 0.
The downside of that is that your scalars will still be plotted, only as a single point in the graph instead of in a table. This is a known open issue f...
Hi Fawad, maybe this can help you get started! They're both c++ and python examples of triton inference. Be careful though, the pre and postprocessing used is specific to the model (in this case yolov4) and you'll have to change it to your own model's needs
Unfortunately, ClearML HPO does not "know" what is inside the task it is optimizing. It is like that by design, so that you can run HPO with no code changes inside the experiment. That said, this also limits us in not being able to "smartly" optimize.
However, is there a way you could use caching within your code itself? Such as using functools' LRU cache? This is built-in in python and will cache function return values if it's ever called again with the same input arguments.
There also see...
It's been accepted in master, but was not released yet indeed!
As for the other issue, it seems like we won't be adding support for non-string dict keys anytime soon. I'm thinking of adding a specific example/tutorial on how to work with Huggingface + ClearML so people can do it themselves.
For now (using the patch) the only thing you need to be careful about is to not connect a dict or object with ints as keys. If you do need to (e.g. ususally huggingface models need the id2label dict some...
1 Can you give a little more explanation about your usecase? It seems I don't fully understand yet. So you have multiple endpoints, but always the same preprocessing script to go with it? And you need to gather a different threshold for each of the models?
2 Not completely sure of this, but I think an AMD APU simply won't work. ClearML serving is using triton as inference engine for GPU based models and that is written by nvidia, specifically for nvidia hardware. I don't think triton will ...
Ok I check 3: The commandclearml-serving --id <your_id> model add --engine triton --endpoint "test_model_keras" --preprocess "examples/keras/preprocess.py" --name "train keras model" --project "serving examples" --input-size 1 784 --input-name "dense_input" --input-type float32 --output-size -1 10 --output-name "activation_2" --output-type float32
should be
` clearml-serving --id <your_id> model add --engine triton --endpoint "test_model_keras" --preprocess "examples/keras/preprocess.py" ...
Hi William!
1 So if I understand correctly, you want to get an artifact from another task into your preprocessing.
You can do this using the Task.get_task()
call. So imagine your anomaly detection task is called anomaly_detection
it produces an artifact called my_anomaly_artifact
and is located in the my_project
project you can do:
` from clearml import Task
anomaly_task = Task.get_task(project_name='my_project', task_name='anomaly_detection')
treshold = anomaly_ta...
For the record, this is a minimal reproducible example:
Local folder structure:
` ├── remove_folder
│ ├── batch_0
│ │ ├── file_0_0.txt
│ │ ├── file_0_1.txt
│ │ ├── file_0_2.txt
│ │ ├── file_0_3.txt
│ │ ├── file_0_4.txt
│ │ ├── file_0_5.txt
│ │ ├── file_0_6.txt
│ │ ├── file_0_7.txt
│ │ ├── file_0_8.txt
│ │ └── file_0_9.txt
│ └── batch_1
│ ├── file_1_0.txt
│ ├── file_1_1.txt
│ ├── file_1_2.txt
│ ├── file_1_3.txt
│ ├── fi...
Hey! So several things here:
As per the plotly docs, you have to give which axis you want to format, in your example plotly can't know. If you look closely to their example, you'll see it's a nested dict, with the key being 'xaxis' Your numpy data has x values of 1 2 3, but your extra layout has values 1 3 4 which don't match. Plotly took the first element of each subarray to be the x value.
If we fix these 2 things, your example becomes:
` task.logger.report_line_plot('this_is_the_title',
...
Hi Fawad!
You should be able to get a local mutable copy using Dataset.get_mutable_local_copy
and then creating a new dataset.
But personally I prefer this workflow:
dataset = Dataset.get(dataset_project=CLEARML_PROJECT, dataset_name=CLEARML_DATASET_NAME, auto_create=True, writable_copy=True) dataset.add_files(path=save_path, dataset_path=save_path) dataset.finalize(auto_upload=True)
The writable_copy
argument gets a dataset and creates a child of it (a new dataset with your ...
Cool! 😄 Yeah, that makes sense.
So (just brainstorming here) imagine you have your dataset with all samples inside. Every time N new samples arrive they're just added to the larger dataset in an incremental way (with the 3 lines I sent earlier).
So imagine if we could query/filter that large dataset to only include a certain datetime range. That range filter is then stored as hyperparameter too, so in that case, you could easily rerun the same training task multiple times, on differe...
That makes sense! Maybe something like dataset querying as is used in the clearml hyperdatasets might be useful here? Basically you'd query your dataset to only include sample you want and have the query itself be a hyperparameter in your experiment?
It part of the design I think. It makes sense that if we want to keep track of changes, we always build on top of what we already have 🙂 I think of it like a commit: I'm adding files in a NEW commit, not in the old one.
Wait is it possible to do what i'm doing but with just one big Dataset object or something?
Don't know if that's possible yet, but maybe something like the proposed querying could help here?
Now worries! Just so I understand fully though: you were already using the patch with success from my branch. Now that it has been merged into transformers main branch you installed it from there and that's when you started having issues with not saving models? Then installing transformers 4.21.3 fixes it (which should have the old clearml integration even before the patch?)
Hi NuttyCamel41 !
Your suspicion is correct, there should be no need to specify the config.pbtxt
manually, normally this file is made automatically using the information you provide using the command line.
It might be somehow silently failing to parse your CLI input to correctly build the config.pbtxt
. One difference I see immediately is that you opted for "[1, 64]"
notation instead of the 1 64
notation from the example. Might be worth a try to change the input for...
Yes, with docker auto-starting containers is def a thing 🙂 We set the containers to restart automatically (a reboot will do that too) for when the container crashes it will immediately restarts, let's say in a production environment.
So the best thing to do there is to use docker ps
to get all running containers and then kill them using docker kill <container_id>
. Chatgpt tells me this command should kill all currently running containers:docker rm -f $(docker ps -aq)
And I...
Wow awesome! Really nice find! Would you mind compiling your findings to a github issue, then we can help you search better :) this info is enough to get us going at least!
I'll update you once I have more!
What might also help is to look inside the triton docker container while it's running. You can check the example, there should be a pbtxt file in there. Just to doublecheck that it is also in your own folder
Doing this might actually help with the previous issue as well, because when there are multiple docker containers running they might interfere with each other 🙂