Reputation
Badges 1
25 × Eureka!Hi @<1558986821491232768:profile|FunnyAlligator17>
What do you mean by?
We are able to
set_initial_iteration
to 0 but not
get_last_iteration
.
Are you saying that if your code looks like:
Task.set_initial_iteration(0)
task = Task.init(...)
and you abort and re-enqueue, you still have a gap in the scalars ?
I have the same offset (that appear after each fail on my scalars).
Hmm, I actually would think this is the "correct" behavior, but I see your point:
Any chance you can open a GH issue ?
LOL AlertBlackbird30 had a PR and pulled it π
Major release due next week after that we will put a a roadmap on the main GitHub page.
Anything specific you have in mind ?
last iteration is no reset and I still have a gap in my scalars
Hmm is this reproducible ? can you check with the latest clearml version (1.10.3) ?
btw: I'm assuming continue_last_task=0
I think I found the issue, the fact the agent is launching it causes it to ignore the "overridden" set_initial_iteration
I suspect it's the localhost - and the trains-agent is trying too hard to access the port, but for some reason does not report an error ...
As we use a custom CUDA image, we do not want this running on user login, and get ugly error messages about missing symlinks.
You can customize the startup bash script (running inside Any container) here:
https://github.com/allegroai/clearml-agent/blob/bf07b7f76d3236c1118b81730c6d9718705a795a/docs/clearml.conf#L145
LackadaisicalOtter14 Would that help?
I imagine that these phantom dependencies will prevent parallelization. Is there a workaround?
yes, they might... workaround might be a bit ugly but copy pasting the functions and changing the name
BTW: I'll check when is the next RC scheduled for, maybe it will already contain a fix π€
Hi @<1523701168822292480:profile|ExuberantBat52>
What do you mean by:
- dataset_1 -> script_2 -> dataset_2a dataset creates a script ?
The second problem that I am running into now, is that one of the dependencies in the package is actually hosted in a private repo.
Add your private repo to the extra index section in the clearml.conf:
None
I added the following to the
clearml.conf
file
the conf file that is on the worker machine ?
I would just add git+
None to your requirements (either in the requirements.txt or even better as part of the pipeline/component where you also specify the repo to be used)
The agent will automatically push the crednetilas when it installs the repo as wheel.
wdyt?
btw: you might also get away with adding -e .
into the requirements.txt (but you will need to test that one)
based on this:
https://clear.ml/docs/latest/docs/references/api/endpoints#post-debugping
" http://localhost:8080/debug.ping β
btw: What'd the usage scenario ?
Hi GiddyTurkey39
First, yes you can just edit the "installed packages" section and add any missing package (this is equal to requirements.txt)
I wonder why trains
failed detecting the "bigquery" package in the first place... Any thoughts ?
K8s + clearml-agent integration.
Hmm is this an on-prem k8s cluster?
Does it mean I can use clearml-serving helm chart alone
Unrelated, the clearml-serving can be deployed on k8s or with docker-compose regardless of where/how clearml-server is deployed
The downstream stages are rankN scripts, they are waiting for the IP address of the first stage.
Is this like a multi-node training, rather than a pipeline ?
Hi @<1523701868901961728:profile|ReassuredTiger98>
is there something like a clearml context manager to disable automatic logging?
Sure just do a wildcard with the files you actually want to autolog the rest will be ignored:
None
task = Task.init(..., auto_connect_frameworks={'pytorch' : '*.pt'}
Hi EagerOtter28
Let's say we query another time and get 60k images. Now it is not trivial to create a new dataset B but only upload the diff: ...
Use Dataset.sync (or clearml-data sync) to check which files where changed/added.
All files are already hashed, right? I wonder whyΒ
clearml-data
Β does not keep files in a semi-flat hierarchy and groups them together to datasets?
It kind of does, it has a full listing of all the files with their hash (SHA2) values, ...
PompousBeetle71 you can check this example:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_torch_distributed.py
I think it should help, if you want a more manual approach, you can check the POpen subprocesses here:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_subprocess.py
Notice both needs to be str
btw, if you need the entire folder just use StorageManager.upload_folder
Now I am passing it the same way you have mentioned, but my code still gets stuck as in above screenshot.
The screenshot shows warning from pyplot (matplotlib) not ClearML, or am I mising something ?
My guess is that it can't resolve credentials. It does not give me any pop up to login also
If it fails, you will get an error, there will never a popup from code π
... We need a more permanent place to store data
FYI you can store the "Dataset" itself on GS (instead of...
Each user creates aΒ
.env
Β file for their needs or exports them in the shell running the python code. Currently I copy the environment variables to an S3 bucket and download it from there
That is a great hack, but who carries the credentials for the S3 bucket? the reason for asking is I;m thinking maybe the code would directly do that (meaning download the .env file and apply them?!)
If you passed the correct path it should work (if it fails it would have failed right at the beginning).
BTW: I think it is clearml-agent --config-file <file here> daemon ...
Does this file look familiar to you?file not found: archive/constants.pkl
Yes this is Triton failing to load the actual model file
So can you verify it can download the model ?