Reputation
Badges 1
25 × Eureka!So that agent on different nodes will probably require different cuda-version images.
That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?
Correct, (if this is running on k8s it is most likely be passed via env variables , CLEARML_WEB_HOST etc,)
ReassuredTiger98 can you send the full log?
Also, what's the clearml-agent version?
fyi: we fixed an issue where the default order of the conda repositories cause pytorch to be installed form the conda forge instead of the pytorch repo, making it the cpu version instead of the gpu version:
This is the correct conda repo orderL
https://github.com/allegroai/clearml-agent/blob/cb6bdece39751eaef975287609b8bab603f116e5/docs/clearml.conf#L66
Okay, I think I lost you...
DilapidatedDucks58 you mean detect at which "iteration" the max value was reported, and then extract all the other metrics for that iteration ?
Okay that actually makes sense, let me check I think I know what's going on
Only the dictionary keys are returned as the raw nested dictionary, but the values remain casted.
Using which function ? task.get_parameters_as_dict
does not cast the values (the values themselves are stored as strings on the backend), only task.connect
will cast the values automatically
The current implementation (since 1.6.3 I think) creates the issues in the linked comment (with images to visualize).
Understood, basically the moment we add nested project view to the dataset (and pipelines for that matter, and both are already being worked on), it should solve everything. Is that correct?
Worker just installs by name from pip, and it installs not my package!
Oh dear ...
Did you configure additional pip repositories in the Agent's clearml.conf ? https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L77 It might be that (1) is not enough, as pip will first try to search the package in the pip repository, and only then in the private one. To avoid that, in your code you can point directly to an https of your package` Ta...
Hmm and you are getting empty list for thi one:
server_info['url'] = f"http://{server_info['hostname']}:{server_info['port']}/"
they are just neighboring modules to the function I am importing.
So I think that is you specify the repo,, on the remote machine you will end with the code of the component sitting at the root folder of the repo, from there I assume you can import the rest, the root git path should be part of your PYTHONPATH automatically.
wdyt?
So I had to add it explicitly via a docker init script
Oh yes, that makes sense, can't think of a better hack other than sys.path.append(os.path.join(os.path.dirname(__file__), "src"))
BTW: the agent will resolve pytorch based on the install CUDA version.
(only works for pyroch because they have diff wheeks for diff cuda versions)
In that case I suggest you turn on the venv cache, it will accelerate the conda environment building because it will cache the entire conda env.
No (this is deprecated and was removed because it was confusing)
https://github.com/allegroai/clearml-agent/blob/cec6420c8f40d92ab1cd6cbe5ca8f24cf351abd8/docs/clearml.conf#L101
Is it also possible to specify different user/api_token for different hosts? For example I have a github and a private gitlab that I both want to be able to access.
ReassuredTiger98 my apologies I just realize you can use ~/.git-credentials for that. The agent will automatically map the host .git-credentials into the docker :)
Hi GrotesqueOctopus42
In theory it can be built, the main hurdle is getting elk/mongo/redis containers for arm64 ...
you can run md5 on the file as stored in the remote storage (nfs or s3)
s3 is implementation specific (i.e. minio weka wassaby etc, might not support it) and I'm actually not sure regrading nfs (I mean you can run it, but it actually means you are reading the data, that said, nfs by definition I'm assuming is relatively fast access)
wdyt?
Ohh sorry you will also need to fix the def _patched_task_function
The parameter order is important as the partial call relies on it.
Hi JitteryCoyote63 , is there a callback for that?
Just run once (from your python console / pycharm etc.):
https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py
WackyRabbit7 hmmm seems like non regular character inside the diff.
Let me check something
that is because my own machine has 10.2 (not the docker, the machine the agent is on)
No that has nothing to do with it, the CUDA is inside the container. I'm referring to this image https://allegroai-trains.slack.com/archives/CTK20V944/p1593440299094400?thread_ts=1593437149.089400&cid=CTK20V944
Assuming this is the output from your code running inside the docker , it points to cuda version 10.2
Am I missing something ?
This is odd because the screen grab point to CUDA 10.2 ...
We might need to change the default base docker image, but I remember it was there... Let me check again
PompousParrot44 I think the website should address that:
https://allegro.ai/
But the TD;DR is the enterprise version adds Full Dataset Versioning on top, with end-to-end integration from code to DLOps (e.g.. data sampling , database query capabilities, data visualization, multi-site support, permission etc,)
Hmmm could you attach the entire log?
Remove any info that you feel is too sensitive :)
the task is being Aborted rather than be in Draft. Am I missing something?
Yes, the reason is for not missing anything that you might have reported on it.
And usually execute_remotely will get the execution queue as a paramter (i.e. immdiatly launching the Task)
You can now (starting v1.0) enqueue an aborted Task so it should not make a difference, you can also reset the Task and edit it in the UI