We all have taught to choose good names for methods in the code we write. There are various coding style guides which set some ruling around method naming conventions and best practices. Although we tend to follow those in most cases, many times we forget to think will somebody else would be able to completely understand what method really does, by reading method name. Or, will our future self be able to clearly know the purpose of the method by reading this method name in 3 months after we wrote it. This is a short story about learning the hard way that proper method naming does matter.
In one of our projects, there are many data imports/exports going on. We have many different data sources and different formats. Plus we are doing different data consolidations, handling duplicates, extracting meta data etc. The volume of data processed and transformed by import/export functionalities is raising very fast. Some time ago, transforming and saving a document within 100 ms was OK but it does not scale anymore. For example, we have some imports which run few hours. We know that the new import feed will have 10-20 times more data – and it does not make sense that a single import process last the whole day (and night). This was a reason we had to look a bit closer to our core logic and assess the performances and propose solutions how it could be improved.
One of the imports I was looking at, does not have that many documents to process but the transformation logic is very complex. In order to extract some meta data (contextual data) out of the source feed, there are various mappings we have to use. The processing of all documents took 68 minutes. After doing some measurements in the code and narrowing down where the most time is spent, I saw that there are couple of meta data fields, in order to “calculate” them, we had to check the mapping into some big tree structure. For each meta data, for each document, a method named TreeCache.getTreeCacheByImportId()
is called. When you review the code you would assume something like this: hey, there is a getter method in TreeCache
cache instance, which will return me an instance of a cache (Map) with keys represented by importId
s. Logical, right. But, actually, when you look at this method, it is creating the new cache Map (with the different key, in this case importId). So the new cache Map instance was being creating each time TreeCache.getTreeCacheByImportId()
is called.
If this method would be called something like createTreeCacheByImportId()
or generateTreeCacheByImportId()
, most probably one who wrote the import logic would understand that there is no point in recreating the cache map for each processed document. He/she would called it once and reuse it in the code, or maybe extend the TreeCache class so that new cache Map is created only once (and after original map is being updated). This method needed some tens of milliseconds to execute, but when you call it 100K times, the time adds up a lot.
After doing a simple refactoring, the import transformation logic completed within 3 minutes. To remind you, initially it ran for 68 minutes. I’ve also found some other places in the code which used this method in the similar way and did refactoring there, too, which improved the performances there as well.
So, even if we did not “officially” break our method naming conventions (although “get” suffix should be used for field getters… most of the time), not-so-optimal and potentially misleading method name made our code to ran 95% slower then it should. Proper method naming does matter!