Reputation
Badges 1
25 × Eureka!ReassuredTiger98 you mean when calling clearml-init ? or default value?
Hi VexedCat68
can you supply more details on the issue ? (probably the best is to open a github issue, and have all the details there, so we have better visibility)
wdyt?
Hi @<1526371965655322624:profile|NuttyCamel41>
How are you creating the model? specifically what do you have in "config.pbtxt"
specifically any python code should be in the pre/post processing code (actually not running on the GPU instance)
Bugs, definitely GitHub, this is the easiest to track.
Documentation, if these are small issues, Slack is fine, otherwise, GitHub issue.
Regrading the documentation, we are working on another iteration of improvement, but if you find inaccuracies/broken links please report 🙂
JitteryCoyote63 in the UI what's the value of "config" ? Is it empty, it a string?
Also, could you check if removing the 'type=str' from the add_argument changes the behavior?
Hi MotionlessCoral18
You can set all mount points here:
https://github.com/allegroai/clearml-agent/blob/6e31171d314a6e9b276c36d45314025783956b00/docs/clearml.conf#L241
Hi OddShrimp85
right place to ask about clearml serving.
It is 🙂
I did not manage to get clearml serving work with my own clearml server and triton setup.
Yes it should have been updated already, apologies.
Until we manage to sync the docs, what seems to be your issue, maybe we can help here?
At the top there should be the URL of the notebook (I think)
Basically when running remotely, the first argument to any configuration (whether object or string, or whatever) is ignored, right?
Correct 🙂
Is there a planned documentation overhaul?
you mean specifically for the connect_configuration ? or in general on the connect approach rationale ?
Different question. How can I pass PYTHONPATH env variable to a task, run by agent (so python can find classes inside m subdirectories)?
Hi HelpfulHare30
By default the working directory will be added to the python path, this means if I have under execution:Working Dir: "." Script: "src/script.py"The root git repo will be added to the python path.
BTW: next RC you could add a flag to the agent to always add the git repo
And what is exactly missing from the "installed packages" ? Is "help_models" an additional wheel you have to install ?
Just making sure here, but remember that if your original code did not have a git repo, the only thing that is "copied" to the trains-server is the initial script, so any accompanying scripts will be missing in the trains-agent environment
Basically it hooks into any torch.save function (monkey patching in realtime)
Oh that makes sense:
` # Create a child process
using os.fork() method
pid = os.fork()
if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)
else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", o...
Hi @<1695969549783928832:profile|ObedientTurkey46>
Why do tags only show on a version level, but not on the dataset-level? (see images)
Tags of datasets are tags on "all the dataset versions" i.e. to help someone locate datasets (think locating projects as an analogy). Dataset Version tags are tags on a specific version of the dataset, helping users to locate a specific version of the dataset. Does that make sense ?
yey 🙂 notice that when executed by the agent the call execute_remotely is skipped, and so does the If statement I added (since running_locally will return False when the process is executed by the agent)
Hi ReassuredTiger98
It's clearml that needs to support subparser, and it does support it.
What are you seeing in the Args section ?
(Notice that at the end all the args parsing are stored on the global "args" variable after you call the pasre_args(), clearml will basically take those variables and put them into Args section)
Could it be you have old OS environment overriding the configuration file ?
Can you change the IP of the server in the conf file, and make sure it has an effect (i.e. the error changed)?
Hi @<1556450111259676672:profile|PlainSeaurchin97>
While testing the migration, we found that all of our models had their
MODEL URL
set to the IP of the old server.
Yes all the artifacts/models/debug-samples are stored "as is" , this means that if you configured your original setup with IP, it is kind of stuck there, this is why it is always preferred to use host-name ...
you apparently also need to rename
all
model URLs
Yes 😞
To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
The issue here you assume both are idle and you need global priority based on resource preference. I understand your scenario now, but it will only hold if enqueuing order is "optimal". For example, if machine Y is running a Task B that is about to be completed (e.g. in a minute) then still machine X will pick the new Task B, and again we end up in the scenario where Task A i...
So that agent on different nodes will probably require different cuda-version images.
That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?
Hi ItchyJellyfish73
You can always archive a Task/Model even when published
In the UI you can right-click and choose archive.
From code you need to add a system tag "archived"from clearml import Task t = Task.get_task(task_id='aabb') t.set_system_tags(t.get_system_tags() + ['archived'])And similarly for Model(model_id='aabb')
Just making sure i understand, you are to upload your models with clearml to the Yandex compatible s3 storage?
, but it seems like I can only trigger a task using a Task scheduler but not a pipeline.
@<1523701132025663488:profile|SlimyElephant79> Maybe we should better state it, but Pipeline is "just" another type of Task. so triggering a Task with the Pipeline ID is essentially triggering the pipeline (do notice you need to select the "services" queue to be used so that the pipeline runs on the correct resource). Make sense ?
Hi TightDog77 _
HTTPSConnectionPool(host='
', port=443): Max retries exceeded with url: /upload/storage/v1/b/models/o?uploadType=resumable (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2633)')))
This seems like a network error to GCP, (basically GCP python package thows it)
Are you always getting this error? is this something new ?
Yes, but as you mentioned everything is created inside the lib, which means the python is not able to intercept the metrics so that clearml can send them to the backend.
Could you extend on the use case of #18 ? how would you use it? what problem will it be solving ?
You can always specify diff clearml.conf files with --config-file 🙂