Q4R4 tbc Import and RevitLookup

I started working on the question answering system Q4R4 Question Answering for Revit API.

The first step is to import The Building Coder blog posts into Elasticsearch and experiment with full-text queries on them.

Furthermore, we are proud to present yet more enhancements to the revamped version of RevitLookup:

Q4R4 Question Sources and Result Presentation

One aspect of q4r4 is searching, and another is what results to present and how.

One useful approach that comes to mind might be:

Given a query, return the most relevant results separately from several different resource collections:

Importing tbc Blog Posts into Elasticsearch

As mentioned in the last post on q4r4, I should start off implementing a simple but intelligent search engine without worrying about machine learning or AI in any of its forms.

I am still reading about Elasticsearch and figuring out how to set up an experimental system to try this out.

Elasticsearch

I started with the The Building Coder blog posts, since I have them all in handy text format, either HTML or Markdown, publicly accessible in the tbc GitHub repository.

I want to import all posts' full text into Elasticsearch.

A similar topic is discussed in having fun with Python and Elasticsearch, Part 1.

I installed the Elasticsearch Python library and implemented a module tbcimport.py to read the tbc main blog post index and open each HTML file on the local system.

Listing and Clearing the Elasticsearch tbc Index

For testing purposes, it is useful to be able to list all posts imported so far and delete the entire collection to clean up and retry; here are two curl commands to achieve that:

curl -XGET 'localhost:9200/tbc/_search?pretty'
curl -XDELETE 'localhost:9200/tbc?pretty'

Strip and Clean Up HTML for JSON Document

After reading the main blog post index file, I need to extract the text from the HTML contents and put it into a JSON document for Elasticsearch to imbibe.

Some useful hints for this are provided here:

I settled for a very simple HTML text extractor using the htmllib HTMLParser.

It initially wrote the text to standard output, but I was able to pass a file-like StringIO object into the DumbWriter constructor to intercept it.

On the first attempt, I successfully imported the first nine posts. Post number 10, Selecting all Walls, failed with a UnicodeDecodeError error message.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 2595: invalid start byte

As it turned out, the offending file was stored in a Windows encoding. I converted it to UTF-8.

Next, I went one step further and eliminated all non-ASCII characters by adding re.sub( r'[^\x00-\x7f]', r'', my_stringio.getvalue() ) to the result of stripping the HTML tags.

This will presumably corrupt some foreign names, expressions, and text passages. I would not expect those passages to be of any major importance for Revit API related queries anyway.

I also added an assertion to ensure that the filenames listed in index.html really do exist.

A surprising number of errors were discovered and fixed in the process.

Now I have successfully imported all The Building Coder blog posts into Elasticsearch.

Q4R4 GitHub Repo and tbcimport.py Script

I celebrated this first step by creating the q4r4 GitHub repository, adding tbcimport.py to it in its current functional state, and creating q4r4 release 1.0.0.

Here is the script in its current state:

The next thing to do is to start experimenting with queries, and presumably with ways to optimise the resulting hits.

RevitLookup Bug Fixes

While I am fiddling with q4r4, the Revit API discussion forum and other Revit API related issues remain as vibrant as ever.

Some new enhancements were added to our irreplaceable Revit BIM database exploration tool RevitLookup.

In the last few weeks, it was significantly restructured to use Reflection and reduce code duplication:

Alexander Ignatovich, @CADBIMDeveloper, aka Александр Игнатович, now submitted a few new bug fixes in his pull request #29:

Many thanks to Alexander for these improvements!

I integrated them into RevitLookup release 2017.0.0.19.

RevitLookup Icons

Just a few hours after Alexander's bug fixes, Ehsan @eirannejad Iran-Nejad chipped in with some further important improvements in his pull request #30:

Many thanks to Ehsan for these improvements!

I integrated them into RevitLookup release 2017.0.0.20.

The most up-to-date version is always provided in the master branch of the RevitLookup GitHub repository.

If you would like to access any part of the functionality that was removed when switching to the Reflection based approach, please grab it from release 2017.0.0.13 or earlier.

I am also happy to restore any other code that was removed and that you would like preserved. Simply create a pull request for that, explain your need and motivation, and I will gladly merge it back again.