These questions were collected in 30 mins at the end of the workshop. The document is useful in highlighting and summarizing some of the topics discussed.
“Quantum machine learning” is highly disfavored
“Neural network potentials” is too specific
“Machine learning potentials” is vastly preferred by the audience
3 years = 0%
5 years = 10%
10 years = 50%
Uncertain = 40%
Would "foundation models" (e.g. GPT-3 like models for molecular machine learning, as Jonathan Godwin discussed) be useful? What kinds of models?
Models trained on lots of QM data? For ML potentials or for representation learning? 75%
Models like coarse-graining or enhanced sampling trained on lots of MD trajectory data? 20 % (for geometric similarity, for downstream sampling tasks, tie down to downstream experiments)
Models trained on lots of biomolecular structures or experimental affinities?
There will be diversification just like we have diversification in simulators
What is the function of foundational models?
Representation for downstream potentials?
Tool to investigate correlations in data?
Are they a source of interpretability or do they take it away?
Representation learning to compare with scarcer, expanse experimental data?
Will data mining make these datasets? We need an ontology and long-term records (report negatives)
Drive data generation and collection (with autonomos=us data)
Open reaction database for organics (plus closed reaxsys or scifinder)
Incentive system and labor of curating and sanitizing one’s internal data. What are the tradeoffs of continuous data releases
Not too much hope in mixing and matching quantum chemistry
How do we create and enforce standards for data?
Data has a longer lifetime than models, better to release it and create more sustained value in the community
Go beyond static datasets (where we can) into dynamic tasks for validation and continuous release/evaluation
General-purpose vs application-driven datasets (i.e. method development is a general application)
What shared resources would best help accelerate the research in this community?
Is there a benefit to collaborating in generating or curating datasets?
Quantum chemical datasets? what should their composition be? how big does it need to be? does it need to be generated by active learning? What level of theory? Does it make sense to mix and match?
Molecular dynamics trajectory datasets? what kind of systems? how much data do we need?
Avoid duplicate effort, for instance in the integration of ML into a simulator via one-off packages. How many independent neighbor-list codes or ANI-LAMMPs plug-ins?
Hardware will follow slowly if at all and needs to have many users
Its just coulomb? What about many-body
If long range is all multipole maybe the ML should go to predicting multipoles
Do we have a Qm9 for long-range? (no)
Might be important. Maybe they are learnable, but the data is lacking (expensive simulations)
And excited states? More transferability challenges because non-local correlated
Most of the time we have low-level kernel challenges and I/O between CPU GPU and not just raw FLOP issues
Quantum computers for strongly correlated systems in solid-state physics
Orbital free DFT
TPU? What are the tradeoffs in development and acceleration?
Inductive bias and interpretability
Prevents the model to learn how to bypass the simulation altogether
Avoid fitting to the test and hurt real-world applicability
Are the physics fields and communities organized and ready to operate in the same metric-driven way as the ML community? We do not have the culture and infrastructure for this in enhanced sampling. PLUMED is doing this for metadynamics.
Financial incentives say drug discovery
What is the flow between innovation, development, and maintenance?
Not so much for sandboxing but more so for deploying fast methods
The general purpose will most likely be bigger and thus slower so bespoke and active learning will result in faster models
Careful benchmarking customized vs. bespoke
Transfer learning / multi fidelity / fine tuning can get the best of both
Other domains do it
Fewer weights are not faster / less memory footprint necessarily and sparsifying does not necessarily accelerate models