## Learning from data basics: the Naive Bayes model

By
Kamil Ciemniewski

March 23, 2016

Have you ever wondered what is the machinery behind some of the algorithms for doing seemingly very intelligent tasks? How is it possible that the computer program can recognize faces in photos, turn an image into a text or even classify some emails as legitimate or as spam?

Today, I’d like to present one of the simplest models for performing classification tasks. The model enables extremely fast execution, making it very practical in many use cases. The example I’ll choose will enable us to extend the discussion about the most optimal approach to another blog post.

### The problem

Imagine that you’re working on an e-commerce store for your client. One of the requirements is to present the currently logged in user with a “promotion box” somewhere on the page. The goal is to maximize our chances of having the user put the product from the box into the basket. There’s one promotional box and a couple of different categories of products to choose the actual product from.

### Thinking about the solution—using probability theory

One of the obvious directions we may want to turn towards is to use probability theory. If we could collect the data about the user’s previous choices and his or her characteristics, we can use probability to select the product category best suited for the current user. We would then choose a product from this category that currently has an active promotion.

### Quick theory refresher for programmers

As we’ll be exploring the probability approaches using Ruby code, I’d like to very quickly walk you through some of the basic concepts we will be using from now on.

#### Random variables

The simplest probability scenario many of us are already accustomed with is the coin toss results distribution. Here we’re throwing the coin, noting whether we get heads or tails. In this experiment, we call “got heads” and “got tails” probability events. We can also shift the terminology a bit by calling them: two values of the “toss result” **random variable**.

So in this case we’d have a random variable—let’s call it **T** (for “toss”) that can take values of: “heads” or “tails”. We then define the probability distribution P(T) as a function from the random variable value to a real number between 0 and 1 inclusively on both sides. In real world the probability values after e. g 10000 tosses might look like the following:

```
+-------+---------------------+
| toss | value |
+-------+---------------------+
| heads | 0.49929999999999947 |
| tails | 0.500699999999998 |
+-------+---------------------+
```

These values are nearing 0.5 more and more with the greater number of tosses.

#### Factors and probability distributions

We’ve shown a simple probability distribution. To ease the comprehension of the Ruby code we’ll be working with, let me introduce the notion of the **factor**. We called the “table” from the last example a probability distribution. The table represented a function from a random variable’s value to a real number between [0, 1]. The **factor** is a generalization of that notion. It’s a function from the same domain, but returning any real number. We’ll explore the usability of this notion in some of our next articles.

The probability distribution is a factor that adds two constraints:

- its values are always in the range [0, 1] inclusively
- the sum of all it’s values is exactly 1

### Simple Ruby modeling of random variables and factors

We need to have some ways of computing probability distributions. Let’s define some simple tools we’ll be using in this blog series:

```
# Let's define a simple version of the random variable
# - one that will hold discrete values
class RandomVariable
attr_accessor :values, :name
def initialize(name, values)
@name = name
@values = values
end
end
# The following class really represents here a probability
# distribution. We'll adjust it in the next posts to make
# it match the definition of a "factor". We're naming it this
# way right now as every probability distribution is a factor
# too.
class Factor
attr_accessor :_table, :_count, :variables
def initialize(variables)
@_table = {}
@_count = 0.0
@variables = variables
initialize_table
end
# We're choosing to represent the factor / distribution
# here as a table with value combinations in one column
# and probability values in another. Technically, we're using
# Ruby's Hash. The following method builds the the initial hash
# with all the possible keys and values assigned to 0:
def initialize_table
variables_values = @variables.map do |var|
var.values.map do |val|
{ var.name.to_sym => val }
end.flatten
end # [ [ { name: value } ] ]
@_table = variables_values[1..(variables_values.count)].inject(variables_values.first) do |all_array, var_arrays|
all_array = all_array.map do |ob|
var_arrays.map do |var_val|
ob.merge var_val
end
end.flatten
all_array
end.inject({}) { |m, item| m[item] = 0; m }
end
# The following method adjusts the factor by adding information
# about observed combination of values. This in turn adjusts probability
# values for all the entries:
def observe!(observation)
if !@_table.has_key? observation
raise ArgumentError, "Doesn't fit the factor - #{@variables} for observation: #{observation}"
end
@_count += 1
@_table.keys.each do |key|
observed = key == observation
@_table[key] = (@_table[key] * (@_count == 0 ? 0 : (@_count - 1)) +
(observed ? 1 : 0)) /
(@_count == 0 ? 1 : @_count)
end
self
end
# Helper method for getting all the possible combinations
# of random variable assignments
def entries
@_table.each
end
# Helper method for testing purposes. Sums the values for the whole
# distribution - it should return 1 (close to 1 due to how computers
# handle floating point operations)
def sum
@_table.values.inject(:+)
end
# Returns a probability of a given combination happening
# in the experiment
def value_for(key)
if @_table[key].nil?
raise ArgumentError, "Doesn't fit the factor - #{@varables} for: #{key}"
end
@_table[key]
end
# Helper method for testing purposes. Returns a table object
# ready to be printed to stdout. It shows the whole distribution
# as a table with some columns being random variables values and
# the last one being the probability value
def table
rows = @_table.keys.map do |key|
key.values << @_table[key]
end
table = Terminal::Table.new rows: rows, headings: ( @variables.map(&:name) << "value" )
table.align_column(@variables.count, :right)
table
end
protected
def entries=(_entries)
_entries.each do |entry|
@_table[entry.keys.first] = entry.values.first
end
end
def count
@_count
end
def count=(_count)
@_count = _count
end
end
```

Notice that we’re using here the **terminal-table** gem as a helper for printing out the factors in an easy to grasp fashion. You’ll need the following requires:

```
require 'rubygems'
require 'terminal-table'
```

### The scenario setup

Let’s imagine that we have the following categories to choose from:

```
category = RandomVariable.new :category, [ :veggies, :snacks, :meat, :drinks, :beauty, :magazines ]
```

And the following user features on each request:

```
age = RandomVariable.new :age, [ :teens, :young_adults, :adults, :elders ]
sex = RandomVariable.new :sex, [ :male, :female ]
relation = RandomVariable.new :relation, [ :single, :in_relationship ]
location = RandomVariable.new :location, [ :us, :canada, :europe, :asia ]
```

Let’s define the data model that resembles logically the one we could have in our real e-commerce application:

```
class LineItem
attr_accessor :category
def initialize(category)
self.category = category
end
end
class Basket
attr_accessor :line_items
def initialize(line_items)
self.line_items = line_items
end
end
class User
attr_accessor :age, :sex, :relationship, :location, :baskets
def initialize(age, sex, relationship, location, baskets)
self.age = age
self.sex = sex
self.relationship = relationship
self.location = location
self.baskets = baskets
end
end
```

We want to utilize a user’s baskets in order to infer the most probable value for a category, given a set of user’s features. In our example, we can imagine that we’re offering authentication via Facebook. We can grab info about a user’s sex, location, age and whether she/he is in relationship or not. We want to find a category that’s being chosen the most by users with a given set of features.

As we don’t have any real data to play with, we’ll need a generator to create fake data of certain characteristics. Let’s first define a helper class with a method, that will allow us to choose a value out of a given list of options along with their weights:

```
class Generator
def self.pick(options)
items = options.inject([]) do |memo, keyval|
key, val = keyval
memo << Array.new(val, key)
memo
end.flatten
items.sample
end
end
```

With all the above we can define a random data generation model:

```
class Model
# Let's generate `num` users (1000 by default)
def self.generate(num = 1000)
num.times.to_a.map do |user_index|
gen_user
end
end
# Returns a user with randomly selected traits and baskets
def self.gen_user
age = gen_age
sex = gen_sex
rel = gen_rel(age)
loc = gen_loc
baskets = gen_baskets(age, sex)
User.new age, sex, rel, loc, baskets
end
# Randomly select a sex with 40% chance for getting a male
def self.gen_sex
Generator.pick male: 4, female: 6
end
# Randomly select an age with 50% chance for getting a teen
# (among other options and weights)
def self.gen_age
Generator.pick teens: 5, young_adults: 2, adults: 2, elders: 1
end
# Randomly select a relationship status.
# Depend the chance of getting a given option on the user's age
def self.gen_rel(age)
case age
when :teens then Generator.pick single: 7, in_relationship: 3
when :young_adults then Generator.pick single: 4, in_relationship: 6
else Generator.pick single: 8, in_relationship: 2
end
end
# Randomly select a location with 40% chance for getting a united states
# (among other options and weights)
def self.gen_loc
Generator.pick us: 4, canada: 3, europe: 1, asia: 2
end
# Randomly select 20 basket line items.
# Depend the chance of getting a given option on the user's age and sex
def self.gen_items(age, sex)
num = 20
num.times.to_a.map do |i|
if (age == :teens || age == :young_adults) && sex == :female
Generator.pick veggies: 1, snacks: 3, meat: 1, drinks: 1, beauty: 9, magazines: 6
elsif age == :teens && sex == :male
Generator.pick veggies: 1, snacks: 6, meat: 4, drinks: 1, beauty: 1, magazines: 4
elsif (age == :young_adults || age == :adults) && sex == :male
Generator.pick veggies: 1, snacks: 4, meat: 6, drinks: 6, beauty: 1, magazines: 1
elsif (age == :young_adults || age == :adults) && sex == :female
Generator.pick veggies: 4, snacks: 4, meat: 2, drinks: 1, beauty: 6, magazines: 3
elsif age == :elders && sex == :male
Generator.pick veggies: 6, snacks: 2, meat: 2, drinks: 2, beauty: 1, magazines: 1
elsif age == :elders && sex == :female
Generator.pick veggies: 8, snacks: 1, meat: 2, drinks: 1, beauty: 4, magazines: 1
else
Generator.pick veggies: 1, snacks: 1, meat: 1, drinks: 1, beauty: 1, magazines: 1
end
end.map do |cat|
LineItem.new cat
end
end
# Randomly select 5 baskets depending the traits of the basket on user
# age and sex
def self.gen_baskets(age, sex)
num = 5
num.times.to_a.map do |i|
Basket.new gen_items(age, sex)
end
end
end
```

### Where is the complexity?

The approach described above doesn’t seem that exciting or complex. Usually reading about probability theory applied in the field of machine learning requires going through quite a dense set of mathematical notions. The field is also being actively worked on by researchers. This implies a huge complexity—certainly not the simple definition of probability that we got used to in high school.

The problem becomes a bit more complex if you consider efficiency of computing the probabilities. In our example, the joined probability distribution—to fully describe the scenario—needs to specify probability values for 383 cases:

```
p(:veggies, :teens, :male, :single, :us) # one of 384 combinations
```

Given that the probability distributions have to sum up to 1, the last case can be fully inferred from the sum of all the others. This means that we need 6 * 4 * 2 * 2 * 4 - 1 = 383 parameters in the model: 6 categories, 4 age classes, 2 sexes, 2 relationship kinds and 4 locations. Imagine adding one additional, 4 valued feature (a season). This would grow our number of parameters to **1535**. And this is a very simple training example. We could have a model with close to 100 different features. The number of parameters would clearly be unmanageable even on the biggest servers we could put them in. This approach would also make it very painful to add additional features to the model.

### Very simple but powerful optimization: The Naive Bayes model

In this section I’m going to present you with an equation we’ll be working with when optimizing our example. I’m not going to explain the mathematics behind it as you can easily read about them on e. g. Wikipedia.

The approach is called the **Naive Bayes model**. It is being used e .g. in spam filters. It also has been used in the past in medical diagnosis field.

It allows us to present the full probability distribution as a product of factors:

```
p(cat, age, sex, rel, loc) == p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)
```

Where e. g. p(age | cat) represents the probability of a user being a certain age given that this user selects cat products most frequently. This is called the “posterior probability”. The above equation states that we can simplify the distribution to be a product of some number of much more easily manageable factors.

The category from our example is often called a **class** and the rest of random variables in the distribution are often called **features**.

In our example, the number of parameters we’ll need to manage when presenting the distribution in this form drops to:

```
(6 - 1) + (6 * 4 - 1) + (6 * 2 - 1) + (6 * 2 - 1) + (6 * 4 - 1) == 73
```

That’s just around 19% of the original amount! Also, adding another variable (season) would only add 23 new parameters (compared to 1152 in the full distribution case).

The Naive Bayes model limits the number of parameters we have to manage but it comes with very strong assumptions about the variables involved: in our example, that the user features are conditionally independent given the resulting category. Later on I’ll show why this isn’t true in this case even though the results will still be quite okay.

### Implementing the Naive Bayes model

As we now have all the tools we need, let’s get back to the probability theory to figure out how best to model the Naive Bayes in terms of the Ruby blocks we now have.

The approach says that under the assumptions we discussed we can approximate the original distribution to be the product of factors:

```
p(cat, age, sex, rel, loc) = p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)
```

Given the definition of the conditional probability we have that:

```
p(a | b) = p(a, b) / p(b)
```

Thus, we can express the approximation as:

```
p(cat, age, sex, rel, loc) = p(cat) * ( p(age, cat) / p(cat) ) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )
```

And then simplify it even further as:

```
p(cat, age, sex, rel, loc) = p(age, cat) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )
```

Let’s define all the factors we will need:

```
cat_dist = Factor.new [ category ]
age_cat_dist = Factor.new [ age, category ]
sex_cat_dist = Factor.new [ sex, category ]
rel_cat_dist = Factor.new [ relation, category ]
loc_cat_dist = Factor.new [ location, category ]
```

Also, we want a full distribution to compare the results:

```
full_dist = Factor.new [ category, age, sex, relation, location ]
```

Let’s generate 1000 random users and looping through them and their baskets - adjust probability distributions for combinations of product categories and user traits:

```
Model.generate(1000).each do |user|
user.baskets.each do |basket|
basket.line_items.each do |item|
cat_dist.observe! category: item.category
age_cat_dist.observe! age: user.age, category: item.category
sex_cat_dist.observe! sex: user.sex, category: item.category
rel_cat_dist.observe! relation: user.relationship, category: item.category
loc_cat_dist.observe! location: user.location, category: item.category
full_dist.observe! category: item.category, age: user.age, sex: user.sex,
relation: user.relationship, location: user.location
end
end
end
```

We can now print the distributions as tables to have an insight about the data:

```
[ cat_dist, age_cat_dist, sex_cat_dist, rel_cat_dist,
loc_cat_dist, full_dist ].each do |dist|
puts dist.table
# Let's print out the sum of all entries to ensure the
# algorithm works well:
puts dist.sum
puts "\n\n"
end
```

Which yields the following to the console (the full distribution is truncated due to its size):

```
+-----------+---------------------+
| category | value |
+-----------+---------------------+
| veggies | 0.10866 |
| snacks | 0.19830999999999863 |
| meat | 0.14769 |
| drinks | 0.10115999999999989 |
| beauty | 0.24632 |
| magazines | 0.19785999999999926 |
+-----------+---------------------+
0.9999999999999978
+--------------+-----------+----------------------+
| age | category | value |
+--------------+-----------+----------------------+
| teens | veggies | 0.02608000000000002 |
| teens | snacks | 0.11347999999999969 |
| teens | meat | 0.06282999999999944 |
| teens | drinks | 0.0263200000000002 |
| teens | beauty | 0.1390699999999995 |
| teens | magazines | 0.13322000000000103 |
| young_adults | veggies | 0.010250000000000023 |
| young_adults | snacks | 0.03676000000000003 |
| young_adults | meat | 0.03678000000000005 |
| young_adults | drinks | 0.03670000000000045 |
| young_adults | beauty | 0.05172999999999976 |
| young_adults | magazines | 0.035779999999999916 |
| adults | veggies | 0.026749999999999927 |
| adults | snacks | 0.03827999999999962 |
| adults | meat | 0.034600000000000505 |
| adults | drinks | 0.028190000000000038 |
| adults | beauty | 0.03892000000000036 |
| adults | magazines | 0.02225999999999998 |
| elders | veggies | 0.04558000000000066 |
| elders | snacks | 0.009790000000000047 |
| elders | meat | 0.013480000000000027 |
| elders | drinks | 0.009949999999999931 |
| elders | beauty | 0.016600000000000226 |
| elders | magazines | 0.006600000000000025 |
+--------------+-----------+----------------------+
1.0000000000000013
+--------+-----------+----------------------+
| sex | category | value |
+--------+-----------+----------------------+
| male | veggies | 0.03954000000000044 |
| male | snacks | 0.1132499999999996 |
| male | meat | 0.10851000000000031 |
| male | drinks | 0.073 |
| male | beauty | 0.023679999999999857 |
| male | magazines | 0.05901999999999993 |
| female | veggies | 0.06911999999999997 |
| female | snacks | 0.08506000000000069 |
| female | meat | 0.03918000000000006 |
| female | drinks | 0.02816000000000005 |
| female | beauty | 0.22264000000000062 |
| female | magazines | 0.13884000000000046 |
+--------+-----------+----------------------+
1.000000000000002
+-----------------+-----------+----------------------+
| relation | category | value |
+-----------------+-----------+----------------------+
| single | veggies | 0.07722000000000082 |
| single | snacks | 0.13090999999999794 |
| single | meat | 0.09317000000000061 |
| single | drinks | 0.059979999999999915 |
| single | beauty | 0.16317999999999971 |
| single | magazines | 0.13054000000000135 |
| in_relationship | veggies | 0.031440000000000336 |
| in_relationship | snacks | 0.06740000000000032 |
| in_relationship | meat | 0.054520000000000006 |
| in_relationship | drinks | 0.04118000000000009 |
| in_relationship | beauty | 0.08314000000000002 |
| in_relationship | magazines | 0.06732000000000182 |
+-----------------+-----------+----------------------+
1.000000000000003
+----------+-----------+----------------------+
| location | category | value |
+----------+-----------+----------------------+
| us | veggies | 0.04209000000000062 |
| us | snacks | 0.07534000000000109 |
| us | meat | 0.055059999999999984 |
| us | drinks | 0.03704000000000108 |
| us | beauty | 0.09879000000000099 |
| us | magazines | 0.07867999999999964 |
| canada | veggies | 0.027930000000000062 |
| canada | snacks | 0.05745999999999996 |
| canada | meat | 0.04288000000000003 |
| canada | drinks | 0.03078999999999948 |
| canada | beauty | 0.06397999999999997 |
| canada | magazines | 0.053959999999999675 |
| europe | veggies | 0.013110000000000132 |
| europe | snacks | 0.0223200000000001 |
| europe | meat | 0.01730000000000005 |
| europe | drinks | 0.011859999999999964 |
| europe | beauty | 0.025490000000000183 |
| europe | magazines | 0.020920000000000164 |
| asia | veggies | 0.02552999999999989 |
| asia | snacks | 0.04319000000000044 |
| asia | meat | 0.03244999999999966 |
| asia | drinks | 0.02147000000000005 |
| asia | beauty | 0.05805999999999953 |
| asia | magazines | 0.0442999999999999 |
+----------+-----------+----------------------+
1.0000000000000029
+-----------+--------------+--------+-----------------+----------+------------------------+
| category | age | sex | relation | location | value |
+-----------+--------------+--------+-----------------+----------+------------------------+
| veggies | teens | male | single | us | 0.0035299999999999936 |
| veggies | teens | male | single | canada | 0.0024500000000000073 |
| veggies | teens | male | single | europe | 0.0006999999999999944 |
| veggies | teens | male | single | asia | 0.0016699999999999899 |
| veggies | teens | male | in_relationship | us | 0.001340000000000006 |
| veggies | teens | male | in_relationship | canada | 0.0010099999999999775 |
| veggies | teens | male | in_relationship | europe | 0.0006499999999999989 |
| veggies | teens | male | in_relationship | asia | 0.000819999999999994 |
(... many rows ...)
| magazines | elders | male | in_relationship | asia | 0.00012000000000000163 |
| magazines | elders | female | single | us | 0.0007399999999999966 |
| magazines | elders | female | single | canada | 0.0007000000000000037 |
| magazines | elders | female | single | europe | 0.0003199999999999965 |
| magazines | elders | female | single | asia | 0.0005899999999999999 |
| magazines | elders | female | in_relationship | us | 0.0004899999999999885 |
| magazines | elders | female | in_relationship | canada | 0.00027000000000000114 |
| magazines | elders | female | in_relationship | europe | 0.00012000000000000014 |
| magazines | elders | female | in_relationship | asia | 0.00012000000000000014 |
+-----------+--------------+--------+-----------------+----------+------------------------+
1.0000000000000004
```

Let’s define a Proc for inferring categories based on user traits as evidence:

```
infer = -> (age, sex, rel, loc) do
# Let's map through the possible categories and the probability
# values the distibutions assign to them:
all = category.values.map do |cat|
pc = cat_dist.value_for category: cat
pac = age_cat_dist.value_for age: age, category: cat
psc = sex_cat_dist.value_for sex: sex, category: cat
prc = rel_cat_dist.value_for relation: rel, category: cat
plc = loc_cat_dist.value_for location: loc, category: cat
{ category: cat, value: (pac * psc/pc * prc/pc * plc/pc) }
end
# Let's do the same with the full distribution to be able to compare
# the results:
all_full = category.values.map do |cat|
val = full_dist.value_for category: cat, age: age, sex: sex,
relation: rel, location: loc
{ category: cat, value: val }
end
# Here's we're getting the most probable categories based on the
# Naive Bayes distribution approximation model and based on the full
# distribution:
win = all.max { |a, b| a[:value] <=> b[:value] }
win_full = all_full.max { |a, b| a[:value] <=> b[:value] }
puts "Best match for #{[ age, sex, rel, loc ]}:"
puts " #{win[:category]} => #{win[:value]}"
puts "Full pointed at:"
puts " #{win_full[:category]} => #{win_full[:value]}\n\n"
end
```

### The results

We’re ready now to use the model and see how well the Naive Bayes model performs in this particular scenario:

```
infer.call :teens, :male, :single, :us
infer.call :young_adults, :male, :single, :asia
infer.call :adults, :female, :in_relationship, :europe
infer.call :elders, :female, :in_relationship, :canada
```

This gave the following results on the console:

```
Best match for [:teens, :male, :single, :us]:
snacks => 0.016252573282200262
Full pointed at:
snacks => 0.01898999999999971
Best match for [:young_adults, :male, :single, :asia]:
meat => 0.0037455794492659757
Full pointed at:
meat => 0.0017000000000000016
Best match for [:adults, :female, :in_relationship, :europe]:
beauty => 0.0012287311061725868
Full pointed at:
beauty => 0.0003000000000000026
Best match for [:elders, :female, :in_relationship, :canada]:
veggies => 0.002156365730474441
Full pointed at:
veggies => 0.0013500000000000022
```

That’s quite impressive! Even though we’re using a simplified model to approximate the original distribution, the algorithm managed to infer the correct values in all cases. You can notice also that the results differ only by a couple of cases in 1000.

The approximation like that would certainly be very useful in a more complex e-commerce scenario, in the case where the number of evidence variables would be big enough to be unmanageable using the full distribution. There are use cases though, where a couple of errors in 1000 cases would be too many—the traditional example is medical diagnosis. There are also cases where the number of errors would be much greater just because the Naive Bayes assumption of conditional independence of variables is not always a fair an assumption. Is there a way to improve?

The Naive Bayes assumption says that the distribution factorizes the way we did it **only if the features are conditionally independent given the category**. The notion of **conditional independence** (apart from the formal mathematical definition) suggests that if some variables a and b are conditionally independent given c, then if we know the value of c then no additional information about b can alter our knowledge about a. In our example, knowing the category, let say :beauty doesn’t mean that e. g sex is independent from age. In real world examples, it’s often very hard to find a use case for Naive Bayes that would follow the assumption in all the cases.

There are alternative approaches that allow us to apply the assumptions that more rigidly follow the chosen data set. We will explore these in the next articles, building on top of what we saw here.

machine-learning optimization probability ruby

## Comments