
Reputation
Badges 1
979 × Eureka!Nice, the preview param will do ๐ btw, I love the new docs layout!
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3)
at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
-> wrong numpy version
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
ok, so there is no way to cache it and detect when the ref changes?
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
Thanks! (Maybe could be added to the docs ?) ๐
Nevermind, i was able to make it work, but no idea how
with 1.1.1 I getUser aborted: stopping task (3)
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
Hi AgitatedDove14 , sorry somehow this message got lost ๐
clearml version is the latest at the time, 1.7.1
Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 โข 1.1.1 โข 2.14
. What these numbers correspond to?
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
Thatโs why I said โnot reallyโ ๐
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
Iโm not too fond of many user configurations, itโs confusing.
100% agree, nevertheless, how much is too many? Currently, there are only two settings in the user preferences category, so one more wouldnโt hurt?
however, clearml is open source, nothing stops you from adding the code and sending a PR
Iโd be super happy to contribute yes! Nevertheless, I am not sure where to start: clearml-server repo? clearml-web repo?
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
awesome ๐
Maybe then we can extend task.upload_artifact
?def upload_artifact(..., wait_for_upload: bool = False): ... if wait_for_upload: self.flush(wait_for_uploads=True)
Bottom line is: trains-server uses elastichsearch image: http://docker.elastic.co/elasticsearch/elasticsearch:5.6.16 which does not have an unlimited license (only free license that expires after some time). From versions 6.3, elasticsearch provides an unlimited free license. Trains should use >=6.3, WDYT?