I started working on the question answering system Q4R4 Question Answering for Revit API.
The first step is to import The Building Coder blog posts into Elasticsearch and experiment with full-text queries on them.
Furthermore, we are proud to present yet more enhancements to the revamped version of RevitLookup:
tbc
blog posts into Elasticsearchtbc
indextbcimport.py
scriptOne aspect of q4r4 is searching, and another is what results to present and how.
One useful approach that comes to mind might be:
Given a query, return the most relevant results separately from several different resource collections:
tbc
Blog Posts into ElasticsearchAs mentioned in the last post on q4r4, I should start off implementing a simple but intelligent search engine without worrying about machine learning or AI in any of its forms.
I am still reading about Elasticsearch and figuring out how to set up an experimental system to try this out.
I started with the The Building Coder blog posts, since I have them all in handy text format, either HTML or Markdown, publicly accessible in the tbc GitHub repository.
I want to import all posts' full text into Elasticsearch.
A similar topic is discussed in having fun with Python and Elasticsearch, Part 1.
I installed the Elasticsearch Python library and
implemented a module tbcimport.py
to read
the tbc main blog post index and open each HTML file on the local system.
tbc
IndexFor testing purposes, it is useful to be able to list all posts imported so far and delete the entire collection to clean up and retry; here are two curl
commands to achieve that:
curl -XGET 'localhost:9200/tbc/_search?pretty'
tbc
index:curl -XDELETE 'localhost:9200/tbc?pretty'
After reading the main blog post index file, I need to extract the text from the HTML contents and put it into a JSON document for Elasticsearch to imbibe.
Some useful hints for this are provided here:
I settled for a very simple HTML text extractor using the htmllib
HTMLParser
.
It initially wrote the text to standard output, but I was able to pass a file-like StringIO
object into the DumbWriter
constructor to intercept it.
On the first attempt, I successfully imported the first nine posts.
Post number 10, Selecting all Walls, failed with a UnicodeDecodeError
error message.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 2595: invalid start byte
As it turned out, the offending file was stored in a Windows encoding. I converted it to UTF-8.
Next, I went one step further and eliminated all non-ASCII characters by adding re.sub( r'[^\x00-\x7f]', r'', my_stringio.getvalue() )
to the result of stripping the HTML tags.
This will presumably corrupt some foreign names, expressions, and text passages. I would not expect those passages to be of any major importance for Revit API related queries anyway.
I also added an assertion to ensure that the filenames listed in index.html
really do exist.
A surprising number of errors were discovered and fixed in the process.
Now I have successfully imported all The Building Coder blog posts into Elasticsearch.
tbcimport.py
ScriptI celebrated this first step by creating the q4r4 GitHub repository, adding tbcimport.py to it in its current functional state, and creating q4r4 release 1.0.0.
Here is the script in its current state:
The next thing to do is to start experimenting with queries, and presumably with ways to optimise the resulting hits.
While I am fiddling with q4r4, the Revit API discussion forum and other Revit API related issues remain as vibrant as ever.
Some new enhancements were added to our irreplaceable Revit BIM database exploration tool RevitLookup.
In the last few weeks, it was significantly restructured to use Reflection
and reduce code duplication:
Reflection
for cross-version compatibilityAlexander Ignatovich, @CADBIMDeveloper, aka Александр Игнатович, now submitted a few new bug fixes in his pull request #29:
CollectorExtElement
field initialization to constructor, use linq extension methods instead of linq syntaxAppDomain.CurrentDomain.BaseDirectory
, the Revit.exe directory path.
I have a dll with a name that contains the substring "revit".
This library depends on another library in another location.
I have an Assembly.Resolve
event subscription to load dependencies correctly.
In such case this code fails, because it can't be aware of correct paths to load referenced libraries.Application.Documents
when more than one document is opened.
The Close
method must not be called – it successfully closes non-active documents and fails to get information about them.Many thanks to Alexander for these improvements!
I integrated them into RevitLookup release 2017.0.0.19.
Just a few hours after Alexander's bug fixes, Ehsan @eirannejad Iran-Nejad chipped in with some further important improvements in his pull request #30:
Path.GetDirectoryName
throws System.ArgumentException if the assembly
Location` is null.Many thanks to Ehsan for these improvements!
I integrated them into RevitLookup release 2017.0.0.20.
The most up-to-date version is always provided in the master branch of the RevitLookup GitHub repository.
If you would like to access any part of the functionality that was removed when switching to the Reflection
based approach, please grab it
from release 2017.0.0.13 or earlier.
I am also happy to restore any other code that was removed and that you would like preserved. Simply create a pull request for that, explain your need and motivation, and I will gladly merge it back again.