Reputation
Badges 1
103 × Eureka!i just need to understand what I should be expecting. I thought from putting it into queue in UI to "running my code remotely" (esp with packages preloaded) should be fairly fast turnaround - certainly not three minutes... i'll have to change my whole pipeline design if this is the case)
I can see agent.vcs_cache.enabled = true
as a printout in the Console, but cannot find docs on how to set this via environment variable, since I'm trying to keep these containers from needing a clearml.conf
file (though I can generate on in the entrypoint script if need be with <EOF>
)
update: ever since turning off git caching, i've had much more stability. i cannot tell whether it's causing a slow down in task execution though - is the clone a shallow one by default?
yeah i ended up figuring it out . i think we are in similar situations (private git repo w token) . ill take a look at my config tomorrow but from memory, you have to set your env variables and have an option in your config to force https protocol if you're using a token .
I would assume a lot of them are logs streaming? So you can try reducing printouts / progress bars. That seems to help for me.
For context: I have noticed the large number of API calls can be a problem when networking is unreliable. It causes a cascade of slow retries and can really hold up task execution. So do be cautious of where work is occurring relative to where the server is, and what connects the two.
for me, it was to set loglevel higher up and reduce the number of prints that my code was doing. since I was using a logger instead of prints, it was pretty easy.
If you're using some framework that spits out its own progress bars, then I'd look into disabling those from options available.
Turning off logs entirely I don't know, will let clearml ppl respond to that.
For sure though the comms of CPU monitoring and epoch monitoring will lead to a lot of calls... but i'll agree 80k seems exce...
It sounds like you understand the limitations correctly.
As far as I know, it'd be up to you to write your own code that computes the delta between old and new and only re-process the new entries.
The API would let you search through prior experimental results.
so you could load up the prior task, check the ids that showed up in output (maybe you save these as a separate artifact for faster load times), and only evaluate the new inputs. perhaps you copy over the old outputs to the new task...
thanks so much!
I've been running a bunch of tests with timers and seeing an absurd amount of variance. Ive seen parameters connect and task create in seconds and other times it takes 4 minutes.
Since I see timeout connection errors somewhat regularly, I'm wondering if perhaps I'm having networking errors. Is there a way (at the class level) to control the retry logic on connecting to the API server?
my operating theory is that some sort of backoff / timeout (eg 10s) is causing the hig...
mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.
I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.
so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take...
thanks for the clarification. is there any bypass? (a git diff + git rev parse should take mere milliseconds)
I'm working out of a mono repo, and am beginning to suspect its a cause of slowness. next week ill try moving a pipeline over to a new repo to test if this theory holds any water.
i understood that part, but noticed that when putting in the code to start remotely, the consequence seems to be that the dag computation happens twice - once on my machine as it runs, and then again remotely (this is at least part of why its slower) . if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps .
this is unlike tasks, which somehow are smart enough to publish in draft form when task.execute_remotely is up top .
do i just leave off pipe.start?
hm. yeah i do see something like what you have in the screenshot.
{"meta":{"id":"d7d059b69fc14cba9ba6ff52307c9f67","trx":"d7d059b69fc14cba9ba6ff52307c9f67","endpoint":{"name":"queues.get_queue_metrics","requested_version":"2.30","actual_version":"2.4"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":"","error_data":{}},"data":{"queues":[{"avg_waiting_times":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0...
still no graphs showing up, and still seeing this error in the console logs.
(deployment is localhost)
everything else is 200 except these two
thank you!
by any chance do you have insights into github.com/allegroai/clearml-server/issues/248 ? dont know if its related to this at all or not, but it is an issue I experienced after upgrading .
Nope still dealing with it .
Oddly enough when i spin up a new instance on the new version, it doesnt seem to happen
I did manage to figure this out with
docker compose stop agent-services
docker compose up --force-recreate --no-deps -d agent-services
and running an export
for the newly generated key.
still though, noticing restarts cause App Credentials to be lost.
this is not about storage access tokens . its about the App Credentials .
those things you set as CLEARML_API_KEY and SECRET so that clients can talk to the api
yup. once again, rebooted and lost my credentials.
when I do a docker compose down; docker compose up -d
... these disappear.
to be clear... this was not happening before I upgraded to the latest version. That is why I am asking about this.
starting to . thanks for your explanation .
would those containers best be started from something in services mode? or is it possible to get no-overhead with my approach of worker-inside-docker?
i designed my tasks as different functions, based mostly on what metrics to report and artifacts that are best cached (and how to best leverage comparisons of tasks) . they do require cpu, but not a ton.
I'm now experimenting with lumping a lot of stuff into one big task and seeing how this go...
yes i actually have been able to turn on caching after rc2 of the agent! been working much better .
oh it's there, before running task.
from task pick-up to "git clone" is now ~30s, much better.
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)
(and fwiw I have this in my entrypoint.sh
)
cat <<EOF > ~/clearml.conf
agent {
vcs_cache {
enabled: true
}
package_manager: {
type: pip,
...
fwiw - i'm starting to wonder if there's a difference between me "resetting the task" vs cloning it.
but pretty reliably some proportion of tasks still just take a much longer time. 1m - 10m is a variance i'd really like to understand.
damn. I can't believe it. It disappeared again despite having 1.15.1 be the task's clearml version.
I'm going to try running the pipeline locally.
(the "magic" of the env detection is nice but man... it has its surprises)