I am back from my short vacation and raring to go again.
I checked in to the Revit API discussion forum today, of course.
I am much more interested in doing more to implement a question answering system or QAS for the Revit API, though.
I repeatedly mentioned my explorations into the very hip subject of machine learning with that goal in mind.
Now I finally had a chat with a real human being AI expert, a fellow Autodesk employee, who put me on the right track to getting started for real:
I last mentioned this topic when asking quo vadis Revit API QAS?
At the time, I was searching for an expert to talk with and get some real hands-on advice from before getting started for real.
This person materialised in the form of Autodesk colleague Sacha Leprêtre, AI expert in Montreal.
Before explaining what I learned from our conversation, I'll mention the name that I came up with for this:
Q4R4
, a more unique acronym than QARA, short for Question Answering system for Revit API.
Sacha consulted with a colleague before our chat and came up with the following advice, in three steps:
It is useless diving into deep learning before first implementing and measuring the results of other specialised search approaches, optionally enhanced and optimised using machine learning techniques.
The latter can only be applied after the former is up and running, and deep learning makes no sense until the first two steps have been completed.
In summary:
The first thing to do is to customise a search engine. Elasticsearch also includes NLP. It is available in several languages, Java based, using Lucene, a state of the art search engine. It is driven using a REST API, supports semantic analysis. Use this to create an own extractor for the Revit API domain information extraction.
Second step: optimise the searches using machine learning. This is not the same as deep learning. Many algorithms exist, e.g., random forest, xg boost, etc. You need samples to test and enhance your approach.
Third step: you may want to apply deep learning techniques to further enhance the solution. Deep learning techniques need to consume a lot of data to bring results.
Here is the French original of the advice provided by Sasha's unnamed friend:
Dans un premier temps, une approche par moteur de recherche spécialisé est probablement la plus efficace, la moins risquée et la plus rapide à mettre en œuvre. Cela peut constituer un plan de contingence (contingency plan) ou une référence (baseline) à l'utilisation d'un système plus sophistiqué faisant appel à l'apprentissage statistique conventionnel voire à de l'apprentissage profond. Je verrais très bien un projet réalisé par étapes.
Algorithmes d'apprentissage statistique conventionnels: Ce qui marche le mieux en apprentissage statistique demeure l'apprentissage supervisé à partir d'exemples/observations étiquetés. C'est encore plus vrai pour l'apprentissage profond, où l'apprentissage non-supervisé relève davantage de la recherche que du domaine des applications. Avec des algorithmes d'apprentissage conventionnels, les meilleurs étant les méthodes ensemblistes (explications plus bas), ont peut obtenir de bons résultats avec quelques centaines d'exemples étiquetés (labeled samples). Les algos ensemblistes sont des méta-algorithmes où l'on combine les résultats d'un ensemble de classificateurs simples. Les plus connues sont le bagging (ré-échantillonnage et vote ou moyenne) et le boosting (pondération des classificateurs). Concrètement on parle de forêt d'arbres aléatoires (Random Forest), de Gradient Tree Boosting et surtout de XGBoost (Extreme Gradient Boosting).
Apprentissage profond: Le principal obstacle pour l'utilisation de l'apprentissage profond est de disposer de suffisamment de données. Typiquement, la règle du pouce est qu'un algorithme d'apprentissage profond supervisé atteindra des performances acceptables avec environ 5000 exemples étiquetés (labeled samples) par question et dépassera la performance humaine lorsqu'il sera entraîné avec un ensemble de données contenant au moins 10 millions d'exemples / observations. Dans le cas d'un nombre insuffisants de données, il existe en gros quatre solutions potentielles. L'étiquetage « par la foule » (crowdsourcing), l'apprentissage semi-supervisé (étiquetage automatique ou semi-automatique) à partir de jeux de données importants mais non étiquetés, l'amplification de données (data augmentation) où l'on génère automatiquement des variantes des données disponibles et le transfert d'apprentissage (learning transfer) qui consiste à ajouter à un réseau de neurones profonds pré-entraînés sur un énorme corpus générique une couche applicative spécifique.
Here is my translation of that:
Initially, a specialized search engine approach is probably the most efficient, the least risky and the fastest to implement. This can be a contingency plan or provide a base for a more sophisticated system involving conventional statistical learning or even deep learning. I would see a project realized in stages.
Conventional statistical learning algorithms: What works best in statistical learning remains supervised learning from labelled examples / observations. This is even truer for deep learning, where unsupervised learning is more about research than application. With conventional learning algorithms, the best ones are the set methods (explained below), which may provide good results with just a few hundred labelled samples. Set algorithms are meta-algorithms where we combine the results of a set of simple classifiers. The best known are bagging (resampling and voting or average) and boosting (weighting of classifiers). Concretely, we speak about random forests of trees (Random Forest), Gradient Tree Boosting and especially XGBoost (Extreme Gradient Boosting).
Deep learning: The main obstacle to the use of deep learning is to have sufficient data. Typically, the rule of thumb is that a supervised deep learning algorithm will achieve acceptable performance with about 5000 labelled samples per question and will exceed human performance when trained with a data set containing at least 10 million examples / observations. In the case of insufficient data, there are roughly four potential solutions. Crowdsourcing, semi-supervised learning (automatic or semi-automatic labelling) from large but unlabelled datasets, data augmentation by automatically generating variants of the available data and learning transfer, which consists in adding a specific application layer to a network of deep neurons pre-trained on a huge generic corpus.
In all cases, the most important point remains creating a good knowledge base: it can consist of just documents, knowing how to connect them, unstructured document content, how are they related.
The elastic search engine can do this.
The UI is another challenge, e.g., when to display a result, display portion of text already, not only links.
Creating an ontology is tough, and that is not what works well.
Better: dictionaries, simple bag of words related to a topic.
First step: create a search engine and ask the system after indexing all words in the content, ask for frequencies of terms, analyse what you have, understand the content. Build on NLP techniques, develop your own extractor. It creates topics based on bag of words. Use machine learning to cluster words. But first just implement manual search.
Many thanks to Sacha and his friend for their invaluable advice!
After the chat with him, I looked more at topics involving NLP and searching, e.g., semantic search with NLP and Elasticsearch, ConceptNet and the maui-indexer.
On Sacha's recommendation, I listened to one of the free online courses on NLP by Prof. Dan Jurafsky and Chris Manning, explaining what question answering is:
An interesting way to go might also be to integrate Q4R4 with DuckDuckGo (cf. DuckDuckHack) or some other Internet search engine.
Before our chat and deciding to get started with Elasticsearch and the implementation of step 1 above, I was still browsing through reams of other interesting stuff that all seems useless now.
Still, here are my notes to self on all that:
Enough of that for now.
I can't wait to get started exploring Elasticsearch in more depth.