Reputation
Badges 1
25 × Eureka!Is this still an issue (if you provide queue name, the default tag is not used so no error should be printed)
Q. Would someone mind outlining what the steps are to configuring the default storage locations, such that any artefacts or data which are pushed to the server are stored by default on the Azure Blob Store?
Hi VivaciousPenguin66
See my reply here on configuring the default output uri on the agent: https://clearml.slack.com/archives/CTK20V944/p1621603564139700?thread_ts=1621600028.135500&cid=CTK20V944
Regrading permission setup:
You need to make sure you have the Azure blob credenti...
I suppose the same would need to be done for anyΒ
clientΒ
PC runningΒ
clearml
Β such that you are submitting dataset upload jobs?
Correct
That is, the dataset is perhaps local to my laptop, or on a development VM that is not in theΒ
clearml
Β system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?
Correct
I assume the account name and key refers to the storage account credentials that you can from Azure Storage Explorer?
correct
it fails because my_package using pip...so I have to manually edit the section and remove the "my_package"
MagnificentSeaurchin79 did you manually add both "." and my_package ?
If so, what was the reasoning to add my_package if pip cannot install it ?
What exactly do you get automatically on the "Installed Packages" (meaning the "my_package" line)?
Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
" It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"
Basically saying, if you can run nvidima-smi from inside the container, it should work.
Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/
Hi PanickyMoth78
So the current implantation of the pipeline parallelization is exactly like python async function calls:for dataset_conf in dataset_configs: dataset = make_dataset_component(dataset_conf) for training_conf in training_configs: model_path = train_image_classifier_component(training_conf) eval_result_path = eval_model_component(model_path)
Specifically here since you are passing the output of one function to another, image what happens is a wait operation, hence it ...
WackyRabbit7 if this is a single script running without git repo, you will actually get the entire code in the uncommitted changes section.
Do you mean get the code from the git repo itself ?
Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)
BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support π
I see, something like:from mystandalone import my_func_that_also_calls_task_init def task_factory(): task = Task.create(project="my_project", name="my_experiment", script="main_script.py", add_task_init_call=False) return task
if the pipeline and the my_func_that_also_calls_task_init
are in the same repo, this should actually work.
You can quickly test this pipeline with
` pipe = Pipelinecontroller()
pipe.add_step(preprocess, ...)
pipe.add_step(base_task_facto...
CluelessFlamingo93 I would also fix the pip version requirements to:pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"]
ContemplativePuppy11
yes, nice move. my question was to make sure that the steps are not run in parallel because each one builds upon the previous one
if they are "calling" one another (or passing data) then the pipeline logic will deduce they cannot run in parallel π basically it is automatic
so my takeaway is that if the funcs are class methods the decorators wont break, right?
In theory, but the idea of the decorator is that it tracks the return value so it "knows" how t...
is there something else in the conf that i should change ?
I'm assuming the google credentials?
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/docs/clearml.conf#L113
its should logged all in the end as I understand
Hmm let me check the code for a minute
AstonishingRabbit13 so is it working now ?
EnviousPanda91 notice that when passing these arguments to clearml-agent you are actually passing default args, if you want an additional argument to Always be used, set the extra_docker_arguments
here:
https://github.com/allegroai/clearml-agent/blob/9eee213683252cd0bd19aae3f9b2c65939d75ac3/docs/clearml.conf#L170
One additional thing to notice, docker will Not actually limit the "vioew of the memory" it will just kill the container if you pass the memory limit, this is a limitation of docker runtime
Are you saying that in the UI you do not see "confusion matrix" at all, only on the GS bucket ?
The confusion matrix shows under debug sample, but the image is empty, is that correct?
execution_queue
is not relevent anymore
Correct
total_max_jobs
is determined by how many machine I launch the script
Actually this is the number of concurrent subprocesses that are launched on Your machine. Notice that local execution means all experiments are launched on the machine that started the HPO process.
Maybe to clarify, I was looking for something with the more classic Ask-and-Tell interface
so the way to connect "ask" in the model, is to just...
Thanks @<1523702652678967296:profile|DeliciousKoala34> I think I know what the issue is!
The container has 1.3.0a and you need 1.3.0 this is why it is re-downloading (I'll make sure the agent can sort it out, becuase this is Nvidia's version in reality it should be a perfect match)
We're wondering how many on-premise machines we'd like to deprecate.
I think you can see that in the queues tab, no?
Hi AstonishingRabbit13
now Iβm training yolov5 and i want to save all the info (model and metrics ) with clearml to my bucket..
The easiest thing (assuming you are running YOLOv5 with python train.py
is to add the following env variable:CLEARML_DEFAULT_OUTPUT_URI="
" python train.py
Notice that you need to pass your GS credentials here:
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/docs/clearml.conf#L113
Basically internally we use psutil to get those stats ...
https://github.com/giampaolo/psutil/issues/1011
See psutil version that fixed that, what do you see on the Task "installed packages" ?
https://github.com/giampaolo/psutil/blob/master/HISTORY.rst#591
Is there any documentation on versioning for Datasets?
You mean how to select the version name ?
I think you are correct π Let me make sure we add that (docstring and documentation)
why are all defined components shown in the UI Results/Plots/PipelineDetails/ExecutionDetails section? Shouldn't it make more sense to show only the ones that are used in that pipeline?
They are listed there (because of the decorator, you basically "say" these are steps so they are listed), the actual resolving (i.e. which steps are actually being called) is done in "real-time"
Make sense ?
Hi FiercePenguin76
Is catboost actually using TB or is it just writing to .tfevent on its own ?