TU Wien:Statistik und Wahrscheinlichkeitstheorie UE (Levajkovic)/Übungen 2023W/HW01.4

Spam filter

One way to design a spam filter is to look at the phrases in an email. In particular, some phrases are more frequent in spam emails. Suppose that we have the following information: 30% of emails are spam, 1% of spam emails contain the phrase ”filled with joy”; 0.2% of non-spam emails contain the phrase ”filled with joy”. Suppose that an email is checked and found to
contain the phrase ”filled with joy”. What is the probability that the email is spam?

Dieses Beispiel ist als solved markiert. Ist dies falsch oder ungenau? Aktualisiere den Lösungsstatus (Details: Vorlage:Beispiel)

Lösungsvorschlag von Lessi[Bearbeiten | Quelltext bearbeiten]

--Lessi 2024-02-07T13:04:11Z

s <- 0.3    # probability of spam
n <- 0.7    # non-spam
js <- 0.01  # prob of "filled with joy" within spam email
jn <- 0.002

We are interested in the probability $P(S|J)={\frac {P(J|S)*P(S)}{P(J)}}$

$P(J)$ is obtained using law of total probability: $P(J)=P(J|S)*P(S)+P(J|N)*P(N)$

j <- js * s + jn * n
j

If 30% of emails are spam and of those 1% contain "filled with joy" then $0.3*0.01=0.003=0.3\%$ of all emails are spam and contain that phrase. 70% of emails are not spam and 0.2% of those contain the phrase. Therefore $0.7*0.002=0.0014=0.14\%$ of all emails are not spam and contain that phrase. This means that $0.44\%$ of all emails contain "filled with joy".

Now we can compute $P(S|J)$ :

(js * s) / j

The probability that the email containing "filled with joy" is spam is $\approx 68\%$

TU Wien:Statistik und Wahrscheinlichkeitstheorie UE (Levajkovic)/Übungen 2023W/HW01.4

Lösungsvorschlag von Lessi[Bearbeiten | Quelltext bearbeiten]

Navigationsmenü