Saturday, October 13, 2012

Compare two sets of data


  Overview

  This tutorial will show you different approaches when required to identify if two data sets are the same or, if not, which data rows are different. Some example scenarios where you would need this comparison: source versions (not managed in dedicated systems like CVS, SVN), new versus old version of external files which should be imported into an application (as to spot the differences), data from two tables from different databases (not linked to each other).
  The approach is to think outside the box and use any kind of useful tool which helps with a quick solution. Usually, the solutions should be quick enough as not to interfere with the daily task activity. I propose for you three engines that I have used successfully with many occasions: Excel, an advanced text editor, a database engine.

  Compare using Excel

  Due to the fact that each row has multiple columns which should be compared, we will need a formula which will involve all columns for comparison. As a intermediary step, we create a new column with the concatenation of all column values for any row; this will create for us an useful column uniquely and exactly identifying the row data (this column I have called "derived full key"). Then, using Excel formula vlookup(), we search for the matching of the full key into the other data set ("search for match" column). Finally, using Excel formula isna,() and if(), we create proper text descriptions of our matching result ("improved version" column).

  Download the Excel file or check the following picture.
Click for full size picture

  Compare using an advanced text editor

  Many modern text editors may be accompanied by a complementary comparison tool. One of these is Notepad++ installed with the Compare plugin (which may be also installed from the Notepad++ built-in plugin manager menu).
  Once you open both text files, you can activate the compare feature from the menu, then simply check one by one the reported differences (extra rows, missing rows, different data in similar rows, data which is the same but appears in different places in the text files).
Click for full size picture

  Compare using a database engine

  For this example, I have used Oracle database and Oracle SQL Developer (freeware) client software. First step is to insert the data to be compared in the database tables. You may either use "insert" statements or Oracle SQL*Loader. The simpler import, when suitable, is to use insert statements from Excel files. Once you have the data open in Excel, you can create the insert statement syntax by using concatenation from the data columns.
  
Download the Excel file or check the following picture. SQL file is also available for download.
Click for full picture size
  After you create the tables and insert the data, you can compare the data sets using a single SQL query; the first part selects the rows from one data set which do not exist in the second data set, and the second sql part does the same, but for opposite data sets:
(
select * from data_set_1
minus
select * from data_set_2
)
union
(
select * from data_set_2
minus
select * from data_set_1
)

  Which method do you like best? Feel free to add other options from your experience.

Sunday, August 5, 2012

Find the error

  There are many times when one needs to fix existing errors in any given application. The following are some hints in optimizing the time spent in finding the exact code which triggered the bug.

 

  Always try to reproduce the error

  The theory is quite simple: if want to fix it, you must first see it in action, otherwise you may be spending time which will not get you to the solution as fast as when you actually manage to see the error for yourself.
  If, at first, it looks like it doesn't repeat itself, don't discourage and check every bit of system which could cause the difference: use the same application files or application server (or at least, a very good copy), the same database, test with the same client, transaction, date, amounts and whatsoever. You never know which is the information which triggers exactly that scenario which leads to the error, so your test should be as similar as possible with the initially reported error (if you need to, don't hesitate to ask for enough details about what happened, as these details will help you trigger the error on demand).

 

  Isolate the guilty code

  You have some idea about which application code is triggered, which database settings matter and so on, but you can't really pinpoint the problem. This is the time to confront your expectations about what should happen versus what really happens. Think about blocks which build together the application workflow which you are testing. Think about these blocks as a data chain, where each link has an expected output. For example, if you expect a data field to have a certain value relevant for future processing, then check which is the real value of that. Did everything go as planned? If yes, then the problem is up ahead, in another code block, if not, then take it a step back until you can tell which is the smallest bit which caused the problem.

 

  Debug or "print" the intermediary data

  When you still need more refinement in finding the guilty code at the exact line, there are two methods which give great results: debugging or printing data throughout the workflow.
  Debugging  is a great way for actually seeing the values of what happens behind the scenes, while processing. It isn't a hard way but it isn't an easy way either, if you are new to it. Basically, it involves connecting in real time to the application engine and actually pause it for getting a chance to study data values and intermediary points, as a method to get you to the problem core. More on debugging - in another blog post.
  The other way is an easy one and it is usually applicable to most systems, from application servers to database functions or similar. The trick is to put some lines in the middle of the intermediary code, which will print for you the values of different data used in the calculation phase. The print does not necessarily mean it is about a console window getting text to be displayed: it can be an application log file, a database temporary table or the script output window of a database tool which runs your code. Any place where you can write some data and then check it - is good enough for your goal.

 

  Confirm your findings by changing the behavior

  Once you think you have reached the bottom of your investigation and you have the problematic code at hand, you should do one last thing which will confirm that you are on the right track: change the application behavior. You don't need to think about the final solution if you are still a little bit uncertain about actually finding the problem. Confirm your findings by altering the application code. For example, write a dummy value, make it do something different, bypass it, do anything you like which, when you will rerun your test scenario, it will show you indeed that you have replaced the original error with the output of your intermediary actions.

  Last, but not least...

  Apply the final solution. Of course, this is custom for each of the cases so I can't really give you an opinion about it. At least, not under this blog post.
  What about you, which are your methods in tracking down the error? Fell free to write your experiences in the comment area!

Thursday, August 2, 2012

Keep safe your project's documents

  It is generally believed that the only valuable set of files for a given project are the source code files. In general, programmers will talk to you about keeping it safe in a version and source control system (VCS), usually named CVS or SVN repository. The truth is that the documents that are part of a project are important as well.

  Just think what would happen if you would lose the user guide, the installation guide, or the Client contract latest version. Indeed, the value of keeping this kind of documents in a source control system is less than keeping the source code files: while you may have text based comparison tools available for comparing two versions of a source code file, you usually don't have such tools available for comparing two versions of an Office Word document.

  Still, it is good practice to keep the Office document on the source control system, as you will make sure that you will not lose that content and it will also keep a file history on hand, with review notes, and with the possibility of manually comparing between different document versions.

  From technical point of view, when we talk about versioning the contents of a document, due to VCS way of working for you, you have to keep the same filename between different updates. That means that a document entitled "User guide" will remain named like that no matter if it is version 1.0 or version 2.0; otherwise said, it is wrong to name them "User guide v1.0" and "User guide v2.0", as this will lead to independent files on VCS, with no history whatsoever. Still, you would like to use some version information available for the Client or for your own internal needs. The satisfying solution is to use the version information somewhere inside the document content itself, like on the first or second page of a Word document. By doing so, you can provide information, for the people using the contents, about which version they are working with, while also keeping the same filename for VCS history handling features.