Reputation
Badges 1
979 × Eureka!ClearML has a task.set_initial_iteration
, I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)
But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Thanks! Unfortunately still not working, here is the log file:
The task is created using Task.clone() yes
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
Ho nice, thanks for pointing this out!
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
mmmh there is no closing of the task happening at that point. Note that just before the task.upload_artifact, I call task.logger.report_table("Metric summary", "Metric summary", 0, df_scores)
, if that matters
AgitatedDove14 I think it’s on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you 🙂
I hit F12 to check projects.get_all_ex
but nothing is fired, I guess the web ui is just frozen in some weird state
as it's also based on pytorch-ignite!
I am not sure to understand, what is the link with pytorch-ignite?
We're in the brainstorming phase of what are the best approaches to integrate, we might pick your brain later on
Awesome, I'd be happy to help!
So the new EventsIterator
is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? I’ll update the cluster to make sure it’s not the case
btw, I tried with alpine instead of ubuntu:18.04, got :
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pulling fs layer
df20fa9351a1: Verifying Checksum
df20fa9351a1: Download complete
df20fa9351a1: Pull complete
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting containe...
Ok so the problem was indeed the way docker was installed (with snap)
I want in my CI tests to reproduce a run in an agent because the env changes and some things break in agents and not locally
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
Hi TimelyPenguin76 , any chance this was fixed already? 🙂
Still getting the same error, it is not taken into account 🤔
is it different from Task.set_offline(True)?
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__
function), and I would like to have these infos logged by clearml
Not really because this is difficult to control: I use the AWS autoscaler with ubuntu AMI and when an instance is created, packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
I want the clearml-agent/instance to stop right after the experiment/training is “paused” (experiment marked as stopped + artifacts saved)
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
and just run the same code I run production
Hi SuccessfulKoala55 , Yes it’s for the same host/bucket - I’ll try with a different browser