Well, in that case, just change the order it should solve it (I'll make sure we have that as the default:
conda_channels: ["pytorch", "conda-forge", "defaults", ]
It should solve the issue 🙂
@<1595587997728772096:profile|MuddyRobin9> are you sure it was able to spin the EC2 instance ? which clearml version autoscaler are you running ?
SmoothArcticwolf58 could you copy paste the entire query and what is the expected results vs reality ?
I think you are correct the env variable is not resolved in "time". It might be it's resolved at import not at Task.init
JitteryCoyote63 nice hack 😄
how come it is not automatically logged as console output ?
Would be very cool if you could include this use case!
I totally think we should, any chance you can open an Issue, so this feature is not lost?
With pleasure 🙂
But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems and practices to do that.
Okay I think this is the issue, handler functions
are not "supposed" to fail, they are supposed to trigger Tasks, these can fail.
e.g.:
Model Tag Trigger -> handler function creates a Task -> Task does something, like build container, trigger CI/CD etc -> Task...
Check the links that are generated in the ui when you upload an artifact or model
PompousParrot44
you can always manually store/load models, example: https://github.com/allegroai/trains/blob/65a4aa7aa90fc867993cf0d5e36c214e6c044270/examples/reporting/model_config.py#L35 Sure, you can patch any frame work with something similar to what we do in xgboost, any such PR will be greatly appreciated! https://github.com/allegroai/trains/blob/master/trains/binding/frameworks/xgboost_bind.py
Yep... some went wrong with the elastic container, I think it lost it's indexes (or they got screwed somehow)
Do you have a backup of the persistence volume attached to the container? Can you try restoring it?
I would restart the entire clearml-server (docker-compose), then can you post here the startup logs? It should provide some info on what's wrong
GorgeousSeagull44 I think this should have worked (basically replacing all the links on the mongo DB with the new IP)
So now for it to take place you need to enqueue the Task and set an agent to pick it up and run it.
When the agent is running the Task the new parameter will be passed.
does that make sense ?
So we basically have two options, one is when you call Dataset.get_local_copy()
, we register it on the Task automatically, the other is a more explicit, with something like:ds = Datasset.get(...) folder = ds.get_local_copy() task.connect(ds, name=train) ... ds_val = Datasset.get(...) folder = ds_val.get_local_copy() task.connect(ds_val, name=validate)
wdyt?
Hi UnevenDolphin73
In theory it "might" work, I have to admit that personally I'm not a fan of what Amazon did to Mongo, i.e. forking their their code base and selling it as a service, just bad open-source practice
(The main issue might be API calls that might not fully match)
wdyt?
Hi DisgustedDove53
Is redis used as permanent data storage or just cache?
Mostly cache (Ithink)
Would there be any problems if it is restarted and comes up clean?
Pretty sure it should be fine, why do you ask ?
Hi PungentLouse55 ,
Yes we have integration with hydra on the todo list since it was first released, we actually know the guy behind Hydra, he is awesome!
What are your thoughts on integration, we would love to get feedback and pointers (Hydra itself is quite capable, and we waiting until we have multiple configuration support, and with v0.16 it was added, so now it is actually possible)
I'm not sure this is configurable from the outside 😞
ScantWorm7
Tensorboard is automatically captured and sent to the trains server. This is in addition to the local copy of your TB files. Actually in most cases the local copy is redundant
JitteryCoyote63 any chance the trains-agent-1
is running in services mode ?
Which means it will spin more than a single experiment at once
I think the main risk is ClearML upgrades to MongoDB vX.Y, and mongo changed the API (which they did because of amazon), and now the API call (aka the mongo driver) stops working.
Long story short, I would not recommend it 🙂
Hi MinuteWalrus85
This is great question, and super important when training models. This is why we designed a whole system to manage datasets (including storage querying, balancing data, and caching). Unfortunately this is only available in the paid tier of Allegro... You are welcome to https://allegro.ai/enterprise/ the sales guys.
🙂
Hmm, you are missing the entry point in the execution (script path).
Also as I mentioned you can either have a git repo or script in the uncommitted changes, but not both (if you have a git repo then the uncommitted changes are the git diff)