JChem Search Engine test environment

news · 5 years ago
by Miklós Vargyas (ChemAxon)
JChem Base
Intended audience: ChemAxon customers Summary The aim of this document is to give insight into the test infrastructure and test types used in the development of JChem Database products. It explains how frequent releases have boosted the enhancements to the JChemBase test environment. The use of this improved test system resulted in higher product quality in weekly releases than was previously achieved when releases were given quarterly. In order to release a large, complex product every week our software engineers undertook the challenge of improving our tests systems in order to enable much shorter feedback loops. With fast feedback, developers can fix problems immediately, thereby introducing quality into the product sooner than was previously attainable with the older quarterly release cycle. Since moving to the frequent release cycles we believe that both the overall quality and the pace at which we add new functionality has greatly improved. Introduction ChemAxon has moved to weekly releases since July 2014. These frequent releases have been the subject of many long debates at various forums including both the European and US UGMs, personal discussions, and worried emails ever since. Motivations for releasing frequently have been elaborated. However, our customers have often expressed their concerns regarding build quality. We would like to address the most important question in this document: can we make weekly releases the same quality as the quarterly ones? We can proudly say: yes. We have an extensive, solid test system which ensures that the behaviour of our products and tools meet requirements in various scenarios and filters out most of the possible errors and regressions. When we talked with customers about the quality assurance aspects of frequent releases we often ended up talking about our test and integration systems. Though initially we considered such details of our development infrastructure irrelevant for customers, we have come to recognize that providing insights into the internal details has helped our clients understand the processes we have adopted to elevate the quality of our products. This article is designed to give insight into the way we do the releases, and how we maintain the high quality of JChem Base, JChem Oracle Cartridge and JChem PostgreSQL Cartridge (JCB hereinafter). Types of tests The most common, and simplest test type is unit test. In JCB we have tens thousands of them. They usually test a very small piece of a functionality (e.g. does a fluorine atom match to a halogen atom list), and execute a small amount of code. The expectation is fixed in the test case itself. One level higher we run the so called etalon tests. They try to imitate commands and scripts that our users execute. The expected outputs are fixed in “etalon” files. If the current output is the same as the etalon, the test passes; otherwise there’s a change in the behavior, possibly a regression. Due to their complexity and rigidity, etalon tests are quite difficult to debug and maintain. Therefore, slowly but surely, they are replaced with new, more flexible test types. We have a huge set of integration tests that are high-level tests and mostly run queries and functions that a real-life user would execute (e.g. a typical search on a larger table). They are similar to unit tests and help us identify problems that were missed during lower-level testing. If the integration tests fail they point out that something in our code is broken and also indicate that a low-level unit test is missing. This will usually result in the addition of a new unit test. In addition to covering the most frequent scenarios, we do our best to cover unusual corner-cases (i.e. where one data point has an unusual value, like empty) as well. A wide variety of tools have been developed that are available to aid developers in writing integration tests, e.g. importing millions of molecules to a table to set up a search. We have also implemented frameworks that support coherence tests which work on big datasets, checking logical relationships that must be fulfilled for all structures. For instance, if a query-target pair is a hit in substructure search, then the target-query pair must be a hit in superstructure search. Another example is that all non-tautomer search hits must be hits in tautomer search. The advantage of this test type is that there is no need to set test expectations one-by-one, and it still checks a whole dataset. There are tests verifying the same query-target pair in different products, on different levels. For example, a test case would execute the following searches with the same match expectation:
  • a one-by-one search (like MolSearch)
  • a JChemSearch using the API (inserting the target into a table and searching on the table with the query)
  • a SQL search utilizing the cartridges (doing the same as in the JChem search case)
See the snapshot below in our custom test generation tool. Beside functional tests we use various non-functional tests, e.g. those that check our documentations. And most importantly we have a large number of performance tests. One type of performance tests are the speed tests. These check the speed of some features, like importing into a table, index creation and searching queries. We measure their duration with the help of a small internal framework that runs tests in the same environment. If the duration of a test changed by more than 10 percent of a previously set baseline, it fails. It’s essential to keep the same environment conditions for the speed tests for months or even for years, because we want to be informed only about the effect of code changes and not the environment changes. It’s a difficult task, but it’s very rewarding when we catch an unexpected slowdown. If a test becomes quicker, or the slowness is acceptable for some reason, we change the baseline accordingly. We also do install and upgrade tests on our cartridges. We test many combinations of different environments and parameters. We also do regeneration tests in JChem Base. These tests compare a new installation to an old version which has been upgraded. They must contain the same data in the database. Furthermore, we run automated UI tests on JChem Manager, the only UI tool in JChem Base. Command line tools (e.g. jcsearch, jcman) are also tested in an automated fashion. At ChemAxon we believe in doing as much automated testing as possible. Lastly, at the end we do smoke tests for each release candidates manually. This means that we only need to check the installation and the very basic functionalities by hand. Test framework, the release process In total we have about a hundred thousand tests only for JCB and about the same number of tests for the products integrated into JChem. We only release a new version if all of them pass. The heart of the test framework is a process called integration. It’s a very complex process, so the following is a simplified description but contains the main elements of it: The ChemAxon code base is separated into modules. Each module has a “current” (i.e. “work in progress”) state which contains the latest version. This is the state that developers of the module work on. It can contain failing test cases, its tests run automatically after every change in the code base. There is also an “integrated” state of each module. The code base that is marked as integrated doesn’t contain failing cases, it’s always a healthy and fail-safe version. This is the state which the release candidate will be created from. If a developer makes changes in a module, he/she uses the integrated state of all the other modules and works on the current state of his/her own module. If the changes don’t break tests in the working module, a process can be initiated by the developer to integrate it. The integration process creates an internal JChem version with the same rule: the current state of the working module is combined with the integrated state of all the others. Then it executes the tests of all modules. If every test passes, the current state of the selected module becomes the integrated state. A failing test in any module shows that the changes in the selected module are incompatible with the other modules, and the integration fails. This process guarantees that all tests in all modules pass in the integrated state. An integration attempt takes about 20 minutes of time. In order to keep the duration of the integration low, some slower tests (e.g. speed tests) are excluded from the integration process, and checked before every release. We have been working on a second-level integration that makes these extra checks right after the integration. So the main steps of the release process are these: (1) first a release candidate is made from the integrated version, (2) then the tests being not part of the integration are executed. Other products using JCB (e.g. IJC) are also checked with the release candidate. Finally we run the manual smoke tests. If any of these steps fail, the release is delayed until the bugs are fixed (that’s why we can’t release at the same time and day of the week). When they all pass, we release the candidate. That means that for each and every frequent release all tests that were used in quarterly releases are executed. Nothing is missed in JCB. Bug fixing or adding new features When we receive a bug report, the first thing we do is that we try to reproduce the reported issue. If we succeed, the result of the reproduction is a new failing test case. We usually simplify the original case and do some other small changes (e.g. we replace confidential structures with common structures). Then we check the production code why the test case fails, and finally we fix it; therefore the original bug is fixed as well. The test case will pass and the existence of the test assures that this case won’t fail later. We follow a similar process when adding new features. In addition to writing the production code, we always add tests to verify our work. If a newly created test fails, we know that the feature is not completed yet. If the new tests pass, we check the other tests of the same module whether we caused regression. Finally, we try to integrate the module, and if it succeeds, the new feature will be part of the new release candidate. These simple steps prevent us to release incomplete changes that could cause regressions or malfunctions. If a feature is not ready yet, it can’t be integrated, and no-one except the developers of the module can use the new code. Comparison of seasonal and frequent releases The JCB testing environment was briefly described above. In essence, before every release all test cases must pass, so we won’t release versions with failing test cases. It is the same as before, when we had longer release cycles. Nothing is missed in the frequent releases, all former tests are still used (and many more that we added in the past 1.5 years). One can assume that fewer releases a year enable longer hardening of the rare releases and thus ensure higher stability and reliability. However, even if every code line is tested, it is theoretically impossible to test for all possible inputs and combinations. No wonder, that despite of our efforts, we received bug reports right after each quarterly release. These error reports (both internal and external) extended the duration of the release process, sometimes the hardening phase took 1-4 months after a 3-4 months development period. At the end of the day our clients waited for the new version with empty hands and we still could not significantly improve the stability of our products. That meant that either (1) the long hardening phase of version ‘v1’ overlapped with the development of the next version ‘v2’ forcing developers to multi-task, which not just lengthened the delivery of both versions, but also increased the risk of more human mistakes; or (2) when a team was ready, but others were still fixing something for the release ‘v1’, they couldn’t start the new development phase, they had to wait for the others to finish. Both of the above ways were very chaotic and the development was slow. Are the new versions with the weekly release system less safe, less hardened or meet lower quality standards? We can confidently say it’s not the case. Our data show that we do not receive more error reports or more severe problems now than before. What our customers experienced as a long polishing, stabilizing period, was the repetition of automated test runs until no failures were found. As we showed earlier, we simply execute our tests on smaller change sets more frequently and the same stability is guaranteed by our system continuously.We can state that every product version with the new release process contains fewer bugs and it is delivered sooner to the clients (containing the same number of features in mid term). Now it’s much easier for all of us. From developers perspective, if I can’t integrate my code, then my changes caused regression and I can’t add them to the release candidate until I fix the code. In that case, I have to check only a few days and not weeks (or sometimes in the past months) of work to find the faulty code parts. We can focus more on development while the same tests guarantee the correctness of the release. Cooperation with our clients We still have a lot of ideas, near and longer term, to enhance our test system. One of the most promising improvement on our radar is to cooperate with our clients in testing. No organization can test every case and every workflow in a complex software. If we don’t have test for a case that our clients have in theirs, we can cause regression at their side without recognizing it. But if we do have that particular case in our suite, we surely won’t release a version that accidentally changes something important in our clients’ production environment. It’s that simple. So if we can unite our efforts and we would include the test cases of our clients into our release process, we could identify and fix regressions and get rid of them before they sneak into a release and thus into mission critical production system. Even if the changes don’t cause regressions, there might be small changes or improvements that make changes in some of our users’ workflows. We could discuss these changes as early as possible and we can modify the behavior before the release if needed. We believe that it would be a leap forward that made both of our lives easier. It is promising to see that some of our clients are open to such ideas and we have started collaboration in defining and implementing shared test systems. We hope you liked this article and we are looking forward to your questions. We welcome any ideas to improve quality and we hope that we can collaborate in improving our test system. Please contact the JCB team to share your opinion.