Opinions 5: September 2011

Wednesday, September 28, 2011

SAS: Proc Logistic shows all tied

Logistic regression is used mostly for predicting binary events. I use logistic regression very often as a tool in my professional life, to predict various 0-1 outcomes. For carrying out logistic regression (and other statistical data processing jobs), I primarily use a popular statistical package called SAS. It has been around since the initial years of statistics based marketing, and has established itself as a defacto standard in the risk analytics domain.

When you run a logistic regression in SAS, it shows you a lot of interesting and important parameters in the output. Among the outputs, it will show you the parameter estimates using which you can make a prediction, various statistics involving the data entered, and the statistical confidence of the individual estimates as well as the overall model.

One of the summary reports which tells you how good the model is doing is found at the bottom of the main output -

Association of Predicted Probabilities and Observed Responses
Percent Concordant     85.6    Somers' D    0.714
Percent Discordant     14.2    Gamma        0.715
Percent Tied            0.2    Tau-a        0.279
Pairs                  7791    c            0.857

An analyst can look at all the above parameters to make a quick judgement on how well the model will perform when put to test for predicting the outcome on a new set of data. The 'c' is basically the area under ROC curve, and Somer's D corresponds to gini coefficient under certain conditions. The c should ideally vary between 0.5 to 1, with 0.5 meaning the model is not working at all.

Therefore I was puzzled when I saw this in one of my outputs -

Association of Predicted Probabilities and Observed Responses
Percent Concordant          0.0    Somers' D    -.000
Percent Discordant          0.0    Gamma        -1.00
Percent Tied              100.0    Tau-a        -.000
Pairs                 183334788    c            0.500

The c is 0.5, and Somer's D is 0 - which means the model is pretty much useless. However, all the other tables in the output (not shown here) told me that the model's performance was good! Why would this happen? This repeated couple of times recently for different models - in some cases the c was not 0.5, but still was much lower than what I would expect from experience. One indication I had was that over time my team was trying to model target events which are more and more rare. I could trace the solution to 'Percent Concordant/Discordant/Tied' being measured incorrectly in this particular table.

To measure these concordance percentages, you need to look at all possible pairs in your pool which have opposite observations (assuming binary outcome). Then you see in how many of these pairs the model predicts outcomes the way it happened (concordant), in how many the model is actually predicting the other way round and therefore is incorrect (discordant), and in how many such pairs the model score is exactly same. All the other values in this particular table are calculated using this kind of pairing. In the table above, it tells us that all the pair of observations which have different outcomes, are predicted to have exactly the same score - effectively translating into the model being completely useless.

However, when I checked the data myself, I saw these was not the case. Why would SAS report incorrect statistics in this particular table? The answer was found after lot of research - to optimize the calculations, SAS assumes two scores are identical if they have a difference of < 0.002. However if you notice there is no mention of this in the output. The documentation was also difficult to find. I suspect they added it recently along with an option that helps you to prevent this from happening (more on this later).

This could lead consequences of varying degree when the event being predicted is very rare. Since some or all of the pairs are considered as tied when the scores are within 0.002, the error will lead to wrong concordance. In all cases the result will be incorrect calculation of of c or Somer's D showing much lower predictive power (even zero in extreme case as above) than what it is really.

Ideally SAS should have done two things -
1. SAS output should mention that the table is "Estimated Association of Predicted Probabilities" or something on similar line.
2. The algorithm should vary the 0.002 threshold based on what is the overall rate of the event in the data.

What are the workarounds for the end users?

First, all analysts should be aware of it and not panic if they see strange reports in this particular table. They should then look at the other standard reports and tables they create to evaluate the model performance, disregarding this table completely.

Second, there are two approaches you can take to get the right values.

Approach 1 is to use SAS option BINWIDTH=0 with the MODEL statement in PROC LOGISTIC. Other than the fact that it can take longer as mentioned in SAS documentation (which should be okay since accuracy in this case should win over time taken), the hitch there is that the option was mentioned in SAS Documentation for version 9.22, and it does not work with SAS version 9.1.3 which my organization uses (which makes me feel that the whole explanation on this was recently added)

Approach 2 is to create your own codes/macros to calculate and report the correct values of any statistics out of these table that you normally look at (for example, gini or ROC). It should not be difficult for a seasoned analyst, and something I will recommend as a good exercise for someone with medium experience. After all, who knows what else could come as a surprise if you use SAS original procedures?

Review: Dragon Age (Origins)

This is the first time I played a party RPG. And it was awesome.

You arrive at Lothering

It took me a while to get into the interface. When it started (I chose a human mage) the movement controls (ADSW) felt awkward - different from other first person games, even though the scene was drawn at first person. It didn't take long though to get used to it. But before even I got used to it, the story had already began. I met the mouse and the bear in the Fade. Was already curious to see how (and if) I will escape Fade (which is sort of the dreamworld in the game, also where spirits reside) to pass the test posed to me. I think there is something in the storytelling which makes you get addicted to it from the start. I encountered the Fade demon and defeated him - was surprised by who he was and how I could defeat him without using any offensive abilities. Immediately I got hooked.

Though started out smoothly (where in the first encounter you don't even have to fight to win), the fighting in the game paces up surprisingly fast. I soon found out that the game is not like taking a walk in the park - at times, the combat needs to be carefully planned making full use of the "pause and strategise" feature.

Meanwhile the story continued on. There are some rather surprising turns that happened in the first part of the game, and after that you get a kind of free roaming ability - i.e. you can choose which area (and hence quest) to cover next. When you have covered all the primary quests, you then get to continue the main storyline to the end of the game. It takes a hell lot of time to reach that stage, and it never gets boring. That's because all the primary quests are crafted with lots of details, characters, locations, environments - and a very long story unique to them. Each one of them seems like a game on its own. I am pretty sure, looking at the vast variety and spectacular difference in all angles between the primary quests, that BioWare had different teams working on separate primary quests.

Despite the length of the game, it never feels dull. There are a lot of satisfying turns and twists to keep you fully engaged in the story. And in the course of the game you will meet some very memorable characters. Some of them will join your cause and fight with you - depending on your actions. Most of the companions also have their own side quest. You will need to unlock the side quests through the course of the game - mostly by gaining their appreciation towards you by giving them gifts, or doing things that they respect.

The game also has multiple endings depending on what you choose. Perhaps because my character died at the end of the game saving the world, it was a rather emotionally charged ending for my story. And after it ends, there is a treat - which took me by surprise. It has a slide show which details the impact of your actions on Ferelden for years to come. Since most of these are tied to specific choices you made over the course of the entire game, it is quite a feast.

The "Landsmeet"

The game offers many addons, of which two add new capabilities (along with small a quest each) which you can spend your money on. One is Warden's Keep, which adds a tash for storing your equipments - otherwise you will have to sell your stuff to make space for new items, as you cannot carry unlimited amount of them. The second is The Stone Prisoner, which unlocks the most unique companion - a golem called Shale. He is very well integrated in the rest of the game's stotyline - so once you unlock him, you will never feel that he is a companion that comes from an addon.

Let me pause here and mention something, rather an alert if you have not played the game yet so that you don't miss out on content. [Spoiler Alert] There is a character called Leliana in the game. I was aware of this since the game advertises an add on campaign called "Leliana's Song", though I wasn't sure exactly at which point she would join the game. After a lot of the story passed, I became suspicious and one day read more on her. And to my dismay I found that she is supposed to join in a scene that is only triggered if you visit a particular pub in Lothering, a village which gets obliterated early in the game. And once it is gone, there is no way to get her there in your group (unless you'd like to download dev-tools and use a crude hack to modify your save file, which I didn't want to do). So if you play the game, visit all houses in Lothering till you find Leliana, before Lothering becomes inaccessible!

Pros -
+ Engaging story
+ Your actions can change the course of the missions
+ Challenging fight system
+ Lot's of spells and abilities, but smooth learning curve
+ Very long game which never became dull
+ Companion characters are good, you are bound to take liking to some of them

Cons -
- Graphics is solid and nice, but not spectacular
- Sometimes certain important items are not clearly marked, which can cause you to miss certain experiences, as serious as a companion in your quest. It may be impossible to fix later on if you don't want to start from a previous save, which may not be a great option considering the length of the game.
- Party stash needs to be unlocked by buying Warden's Keep, this should have been a native feature in the game

Overall, this is an excellent single player game - definitely one of the best that I have played. You are bound to have hours of fun playing it - in fact I will be surprised if you don't loose couple of day's sleep due to this game. Excellent and recommended.

Tuesday, September 27, 2011

All about "Information Value"

In statistical data mining, sometimes we need to determine out of a set of variables which ones are best in capturing a desired behavior. For example, let's say you have a pool of customers for your credit card company, and you want to determine who out of them are about to default (i.e. refuse to pay up after possibly making a huge expense). You need to then identify which of the attributes you have on the customer can potentially identify and alert you of such behavior. One of the popular ways in which this is done by analysts is by looking at something called 'Information Value'. In the context of data mining is also sometimes referred to by the short form - InfoVal.

Definition
Information Value of $x$ for measuring $y$ is a number that attempts to quantify the predictive power of $x$ in capturing $y$. Let's assume the target variable $y$ which we are interested in being able to measure, is a 0-1 variable (or an indicator). Let's also further assume that it is the number of accounts who will go bad in the immediate future. Let's now divide our population in 10 equal parts after sorting the entire pool by $x$, and create the deciles. Now we are all set to define Information Value -
$$IV_x = \sum_{i=1}^{10}{\left(bad_i-good_i\right)\ln\frac{bad_i}{good_i}}$$
Here,
$i$ runs from 1 to 10 deciles in which we have divided the data,
$bad_i$ is the proportion of bad accounts captured in $i$th decile out of all bad accounts in the population,
$good_i$ similarly is proportion of good (i.e. not bad) accounts in $i$th decile.

Note that the variable whose effectiveness you want to measure is getting used since it is the variable by which the entire data is sorted and divided into deciles.

How does it work?
But why does it work? You can check that if $x$ has no information on $y$ at all, then the $IV$ turns will trun out to be zero. That's because when you sort by $x$ and create deciles, the deciles are as good as random with respect to $y$. Hence, each decile should capture 10% of total bads and 10% of total goods. So $bad_i-good_i=0$ and $\ln\frac{bad_i}{good_i}=0$. So the $IV$ turns out to be zero.

On the other hand if after sorting by $x$ some decile has higher or lower concentration of bad's than good's, then that would mean that that particular decile is different from the overall population, and $x$ lets us create it. The decile will contribute a positive value to the summation which defines $IV$ in the equation above. So it is clear that for a good $x$ variable, there will be more of such deciles where the proportion of goods and bads differ - and by a larger margin as your $x$ is more effective in capturing $y$ - hence $IV$ indeed gives a measure of predictive power of $x$.

Issues
However, there is something artificial in the definition of $IV$ above - it is the functional form. Indeed there can be many different ways to create the functional form that is being summed up.

To give some examples - $\sum_{i=1}^{10}{\left(bad_i-good_i\right)^2}$, $\sum_{i=1}^{10}{\left|bad_i-good_i\right|\frac{n_i}{n}}$, etc. all should be equally good candidates.

The last one in particular is interesting - because it has the proportion in the equation, making it a consistent measure. That is, if you decide to divide the data into 20 parts or 30 parts and so on, you will go closer and closer to a limit. Incidentally, the limit to which it converges is essentially gini/2 under some assumptions. For $IV$ on the other hand, this leads to a problem - you cannot divide the data indefinitely - as you may hit segments which have no good (or no bad) accounts at all - in which case taking the $ln$ will bomb.

Also for the same reason, it will be inaccurate and unfair to compare two variables when one of them has ties - which makes it impossible to unambiguously divide the population into 10 equal parts. (This is cautionary since it is a mistake that I have seen many analysts make, even when drawing Lorenz curves for example where the inequality of deciles can and should always be taken into account. For IV, to stress, it however cannot be taken into account.)

Origin
If $IV$ has these problems, how did the definition come about in the first place? Also should you use it?

I can only guess how such a metric came into being, and became popular over that despite having drawbacks. The concept of information actually arises in several other branches of mathematics, eg. Information Theory where you measure a very interesting quantity called entropy (which vaguely resembles $IV$), and Fisher's Information of a variable (which deals with how much information a statistic contains about a particular parameter of a distribution). I think someone was aware of these concepts, and wanted to create something similar for the corporate world to measure "information captured in a variable". In the corporate world how things work in most places, is that you show something works, and it becomes de-facto unless someone else challenges it and shows something else works better. Somehow the challenges are rare to come by in most places, which is why at senior leadership there needs to be a conscious effort into engaging the employees into thinking rather than following tradition. For $IV$, this was pretty much the case. We have already seen that it works, and after it was shown to work it became a standard - without proper math/statistical backing to explore pros and cons and to see if it can be bettered. And once something becomes a standard, it gains inertia and becomes legacy - which propels it through 'common wisdom'. The employees of that particular organization learned about it (they learned the formula, understood what it tries to measure, learned the SAS codes or SQL queries required to compute it). When they migrated, the knowledge migrated with them.

Recommendation
The only reason you might want to use $IV$ is because of legacy - you may have worked on a bank which used it, and may have developed some reference points (i.e. thumb rules like "if $IV$ is more than $abc$ then the variable is good, otherwise it is bad") of good or bad $IV$ values.Even if you have used $IV$ before, it should not take a very long time to develop experience on reference points for gini - you will be able to do that through a course of one or two projects. I will in general though recommend not to use IV, in favor of other more intuitive measures - roc, gini, KS to name some. I personally prefer gini - it is easy to understand, is consistent, and is very robust in measuring power of a variable.

To conclude, though $IV$ is still popular in some places of the banking world, I would recommend not using it in favor of gini/ROC which measures the same thing while being more intuitive and without the flaws of $IV$.

How real are you?

Are we real? Of course we are. How much real are we, say when compared to characters of a story, movie, or a video game? You might think that's an odd question - reality is reality, fiction is fiction! It may be so, but let's pause and think for a moment.

What differentiates reality and fiction? We can imagine a huge moon overlooking shores of an urban city surrounded by water - but that won't be real. Or will it? For an event to be real, you need real humans (supposing it involves humans) who have a mind governed by psychology - who think, feel reason like we do. We need real environment, made of billions of atoms so that the repercussion of every action is calculated with the greatest scrutiny. Merely outlining the large scale effects of an event (as is done in a fiction) is not enough to make it real. In short we should have an Universe where all the laws of physics are followed just as in ours, and we should be in it.

Before we go further into it, I would like to propose a thought experiment. Let's take a living human brain, and let's scan the configuration of the brain as precisely as possible. Fear not - since due to recent advances in modern physiology, this can be done without harming the person at all. Let's say we scan every neuron and the state of it - in the sense which other neurons is it connected to, and where is it currently sending any electric impulse. Then in principle, one can sit down and using pen and paper simulate the brain - by writing down its current state and slowly evolving it following the right biological course of the neurons. Using more sophisticated science, infact one may be able to code questions posed to the brain in terms of sound waves - interpreted into electric signals by the relevant portion of the brain. And once the computation has taken its course, following the same biological simulation one may carry out the necessary recoding and end up with the exact vibration of the larynx, from which sound waves can be deduced - pretty much the same sound waves which a real mouth will speak if a real brain is asked a real question. (You can find a variant of this experiment here in Wikipedia.)

However note that the person who is carrying out these computations, does not need to at all 'understand' the question that was posed in form of sound waves. The scanned brain may even belong to Stephen Hawking - and the question a deep one in astrophysics - and the person who is simulating this may not even understand English - but at the end of calculations he will still come out with exactly the same answer that a real Stephen Hawking would have given. Just as in real world, the consciousness and understanding that a brain possesses may be an emergent phenomenon, so can it be for a person meticulously simulating it on pen and paper.

So if the brain of a living person is scanned, and then tediously simulated by following mathematical equation on pen and paper, will give rise to exact same answers that the real brain will give (provided no mistakes are made). This is just like you can predict where Jupiter will be on this day next year - even though you of course cannot see it in the next year in our reality at this moment.

Do you think then the simulated brain is 'real'? Let's see. Let's suppose we note down the configuration of a real brain for simulation, and then destroy the real brain. (Well that's a rather inhuman thing to do, but it is only a thought experiment.)

Will the thought, feeling, knowledge, experience be preserved as the brain is simulated? Yes.

Will the brain answer to any question asked exactly the same way the 'real' person would have answered? Yes.

In general will the brain react in exactly the same way in which the real brain will react, when posed with the same (simulated) conditions? Yes.

So by all accounts the simulated brain's actions will match with the real brain - it will even have a (simulated) consciousness corresponding to the real brain.

The only fact which makes it not real, is that you are not there - you cannot directly see or talk to it or interact with it. What if your brain scan is also taken and is simulated alongside the other brain? Your simulated brain will have no doubt that the other brain is real. For them, since there is no way to detect the real you, your world will be as fictitious - just a 'what if we are being simulated' hypothesis.

Is there any way for the simulated brains to figure out that their world is merely simulated - and is not 'real'? The answer seems no (unless we choose to give them hints - eg. suddenly make a paper appear which has answers to their key scientific questions - which let's say we do not want to). They will see, feel, smell, touch their simulated world just as we do. They will have their own (simulated) thoughts, feelings, experiences, memories. They will experience (simulated) happiness, fear, joy, anger, peace just as we do. For them their (simulated) world is as real as we feel ours is.

It is a slightly disturbing conclusion that there is no way for them to detect they are being simulated, since it means that we could be living in a world which is entirely being simulated, and we would have no way to know about it. But it gets worse.
That's all my schedule permits me to write in the real (or simulated?) world where I live - we will continue the exploration next week. Update: Continued here.

Monday, September 26, 2011

Review of Race Driver GRID coming from NFSMW

I have had Need for Speed: Most Wanted for a long time in my computer, and I recently installed GRID. I will jot down couple of observations that I can make when I compare the two.

+ GRID has two features which are sorely missing in NFSMW, namely ability to replay your race, and damage your car. Both features were there in some other even earlier versions of NFS, but somehow did not make it through to Most Wanted.

In GRID, the replay is shown in a computer controlled camera (you can change it to one of the standard ones if you want), and as the rest of the game is very polished in visual quality. It really makes you look good!

The damage model is good, though coming from NFS where there is no damage, it takes a little time to adopt yourself for handling it. However it does not get too much in the way - as long as you avoid high speed crashes. Your car is immobilized immediately (and the race is over) if you hit the wall at 170mph. The dashboard shows which part of the car is damaged while you are driving. The only other effect I have seen is that if you badly damage one of your wheels, the car will be disbalanced and will have a tendency to automatically steer to one side which you will have to constantly counter throughout the remaining part of the race.

- There is no civilian vehicles and no cops. All the races are pure races, with closed tracks walled off with concrete blocks or blocks of car tires. The tires (as well as parts of your car) get scattered if you bump into them slow enough not to total your car. They will then remain on the track till the end of the race.

- There is no nitro boost. This makes winning more a matter of control.

+/- The difficulty is notched up, partly because of the damage model mentioned above. Also unlike NFS, the competitors here are all more serious in all levels - it is unlikely that you will find someone driving at a slower speed than yours in a long road without turns. Also it will take some time to get used to the controls and the cars, the physics is slightly different from NFSMW. Apparently it's a little more 'realistic' or 'sim' like. However after a bit of practice, you will start winning some of the races. The increased difficulty provides a sense of accomplishment when you win.

+ The game has a feature called 'flashback' which you can use a limited number of times (maximum 5) depending on your difficulty. This lets you rewind time to correct your mistakes (much like Prince of Persia series) and works beautifully in the game. This helps countering the difficulty and balance the game a bit more.

+ The graphics is just a treat for the eyes. The cars and the tracks are gorgeous, they make you want to play on just for the looks. I ended up 'test driving' the cars I own for adapting to the controls, and it was really nice. The menu system is very nice too, it's never static. You will feel that you are almost setting something into motion when you navigate through it.

+ The cars and the race types all have a different feel. It's like multiple racing games in one. You can drive 'Pro Muscle' through the city, or professional motor racing cars in 'Pro Tuned'. The cars for 'Drift' racing have a different weight setting which makes them more susceptible to drifting when you turn.