WackyRabbit7 my apologies for the lack of background in my answer π
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is proba...
Oh I see your point, that makes sense, it should check the state of the Task and force it to aborted so it can be renequed, the issue with reset it will clear the previous run execution, which I think we do not want, Wdyt?
Hi @<1523708920831414272:profile|SuperficialDolphin93>
The error seems like nvml fails to initialize inside the container, you can test it with nvidia-smi and check if that wirks
Regrading Cuda version the ClearML serving inherits from the Triton container, could you try to build a new one with the latest Triton container (I think 25). The docker compose is in the cleaml serving git repo. wdyt?
Can you reproduce this behavior outside of lightning? or in a toy example (because I could not)
I might have found it, tqdm is sending{ 1b 5b 41 } unicode arrow up?https://github.com/horovod/horovod/issues/2367
Hi @<1539055479878062080:profile|FranticLobster21>
hey, how do I use local files as dependencies?
You mean like a repository ?
Can I specify in task what local files do I use that should be packaged?
In a git repo?
Basically the agent can do two things, either replicate a single script or clone a git repo + uncommitted changes
Nice!!!
Are you aware of a limitation of "/events.get_task_events" preventing from fetching some of the images stored on the server
Are you saying you see them in the UI, but cannot access them via the API ?
(this would be strange as the UI is firing the same API requests to the back end)
Yep, found it, the --name is marked as required and the argparser throws an error ...
I'll make sure this is fixed as well π
so other process can use it
This is why there is a model repository, so you can query the last model created, or by name or tag or query the Task that created it and then via the Task the model and it's location.
This is a stable way to make sure your application code (the one using the model) will get to use stable models regardless of the training processes.
I would add a Tag to the model and then search based on the project and the tag, wdyt?
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
PanickyMoth78 I think I understand what you are saying, but it is hard to see if there is a "bug" here or a feature...
Can you post the full code of the pipline?
NastySeahorse61 I would try to open in incognito mode (i.e. no cookies etc.), did you also change the address of the server?
Hi MotionlessSeagull22
Hmm I'm not this is possible in the UI.
You can compare multiple experiments and view the images in form of thumbnails one next to the other, But full view will be a single image...
You can however right click on the image and get a direct link, then open a new tab ... :(
Hi @<1545216070686609408:profile|EnthusiasticCow4>
The auto detection of clearml is based on the actual imported packages, not the requirements.txt of your entire python environment. This is why some of them are missing.
That said you can always manually add them
Task.add_requirements("hydra-colorlog") # optional add version="1.2.0"
task = Task.init(...)
(notice to call before Task.init)
is there a way to increase the size of the text input for fields or a better way to handle lists?
No π
Maybe an easier way to use connect_configuration instead ? it will take an entire dict and store it as text (format is hocon, which is YAML/Json compatible, which means it is hard to break when editing)
Should I useΒ
update_weights_package
Yes
BTW, config.pbtxt you should pass when "registering" the endpoint with the CLI
I remember being told that the ClearML.conf on the client will not be used in a remote execution like the above so I think this was the problem.
SubstantialElk6 the configuration should be set on the agent's machine (i.e. clearml.conf that is on the machine running the agent)
- Users have no choice of defining their own repo destination of choice.
In the UI you can specify in the "Execution" tab, Output "destination", a different destination for the models/artifacts. Is this...
MysteriousBee56 Edit in your ~/trains.conf:api_server: http://localhost:8008
toapi_server: http://192.168.1.11:8008
and obliviously the same for web & files
I'll make sure we fix the trains-agent to output an error message instead of trying to silently keep accessing the API server
Getting you machine ip:
just run :ifconfig | grep 'inet addr:'Then you should see a bunch of lines, pick the one that does not start with 127 or 172
Then to verify runping <my_ip_here>
it does appear on the task in the UI, just somehow not repopulated in the remote run if itβs not a part of the default empty dictβ¦
Hmm that is the odd thing... what's the missing field ? Could it be that it is failing to Cast to a specific type because the default value is missing?
(also, is issue present in the latest clearml RC? It seems like a task.connect issue)
Iβll check if I could wrap the code in something that calls the Task.delete if debugging
Whatever you think works best for you, I was genuinely curious π
To me (personally) it is helpful to have a log even while debugging (comparing to previous runs etc, trying to see what went wrong even on a console output level). When I'm done I just search for everything I worked on select all, and archive them. Then a cleanup service in the background clears all the archived Tasks once they ar...
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?
Also can you right click on the image and save it on your machine, see if it is cropped, or it is just a UI issue