X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_chic.git/blobdiff_plain/a6cad0acff5edb59be08de7a97cbf8b1d78a837a..4155cdb38e520aa0c325bf29b9e57da783d7f89e:/chapter2.tex?ds=sidebyside diff --git a/chapter2.tex b/chapter2.tex index 8ea057f..d5f45e3 100644 --- a/chapter2.tex +++ b/chapter2.tex @@ -215,7 +215,8 @@ between these two parts) and of the same respective cardinals as $A$ and $B$. Let $\overline{Y}$ and $\overline{B}$ be the respective complementary of $Y$ and $B$ in $E$ of the same cardinal $n_{\overline{b}}= n-n_b$. We will then say: -Definition 1: $a \Rightarrow b$ is acceptable at confidence level + +\definition $a \Rightarrow b$ is acceptable at confidence level $1-\alpha$ if and only if $$Pr[Card(X\cap \overline{Y})\leq card(A\cap \overline{B})]\leq \alpha$$ @@ -253,3 +254,494 @@ Assumptions: interval $[t,~ t+T[$ depends only on T; \item h3: two such events cannot occur simultaneously \end{itemize} + +It is then demonstrated (for example in~\cite{Saporta}) that the +number of events occurring during a period of fixed duration $n$ +follows a Poisson's law of parameter $c.n$ where $c$ is called the +rate of the apparitions process during the unit of time. + + +However, for each transaction assumed to be random, the event $[a=1]$ +has the probability of the frequency $\frac{n_a}{n}$, the event[b=0] +has as probability the frequency, therefore the joint event $[a=1~ + and~ b=0]$ has for probability estimated by the frequency +$\frac{n_a}{n}. \frac{n_{\overline{b}}}{b}$ in the hypothesis of absence of an a priori link between a and b (independence). + +We can then estimate the rate $c$ of this event by $\frac{n_a}{n}. \frac{n_{\overline{b}}}{b}$. + +Thus for a duration of time $n$, the occurrences of the event $[a~ and~ not~b]$ follow a Poisson's law of parameter : +$$\lambda = \frac{n_a.n_{\overline{b}}}{n}$$ + +As a result, $Pr[Card(X\cap \overline{Y})= s]= e^{-\lambda}\frac{\lambda^s}{s!}$ + +Consequently, the probability that the hazard will lead, under the +assumption of the absence of an a priori link between $a$ and $b$, to +more counter-examples than those observed is: + +$$Pr[Card(X\cap \overline{Y})\leq card(A\cap \overline{B})] = +\sum^{card(A\cap \overline{B})}_{s=0} e^{-\lambda}\frac{\lambda^s}{s!} $$ + + But other legitimate drawing processes lead to a binomial law, or + even a hypergeometric law (itself not semantically adapted to the + situation because of its symmetry). Under suitable convergence + conditions, these two laws are finally reduced to the Poisson Law + above (see Annex to this chapter). + +If $n_{\overline{b}}\neq 0$, we reduce and center this Poison variable +into the variable: + +$$Q(a,\overline{b})= \frac{card(X \cap \overline{Y})) - \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}}} $$ + +In the experimental realization, the observed value of +$Q(a,\overline{b})$ is $q(a,\overline{b})$. +It estimates a gap between the contingency $(card(A\cap +\overline{B}))$ and the value it would have taken if there had been +independence between $a$ and $b$. + +\definition $$q(a,\overline{b}) = \frac{n_{a \wedge \overline{b}}- \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}}}$$ +is called the implication index, the number used as an indicator of +the non-implication of $a$ to $b$. +In cases where the approximation is properly legitimized (for example +$\frac{n_a.n_{\overline{b}}}{n}\geq 4$), the variable +$Q(a,\overline{b})$ approximately follows the reduced centered normal +distribution. The intensity of implication, measuring the quality of +$a\Rightarrow b$, for $n_a\leq n_b$ and $nb \neq n$, is then defined +from the index $q(a,\overline{b})$ by: + +\definition +The implication intensity that measures the inductive quality of a +over b is: +$$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] = +\frac{1}{\sqrt{2 \pi}} \int^{\infty}_{ q(a,\overline{b})} +e^{-\frac{t^2}{2}} dt,~ if~ n_b \neq n$$ +$$\varphi(a,b)=0,~ otherwise$$ +As a result, the definition of statistical implication becomes: +\definition +Implication $a\Rightarrow b$ is admissible at confidence level +$1-\alpha $ if and only if: +$$\varphi(a,b)\geq 1-\alpha$$ + + +It should be recalled that this modeling of quasi-implication measures +the astonishment to note the smallness of counter-examples compared to +the surprising number of instances of implication. +It is a measure of the inductive and informative quality of +implication. Therefore, if the rule is trivial, as in the case where +$B$ is very large or coincides with $E$, this astonishment becomes +small. +We also demonstrate~\cite{Grasf} that this triviality results in a +very low or even zero intensity of implication: If, $n_a$ being fixed +and $A$ being included in $B$, $n_b$ tends towards $n$ ($B$ "grows" +towards $E$), then $\varphi(a,b)$ tends towards $0$. We therefore +define, by "continuity":$\varphi(a,b) = 0$ if $n_b = n$. Similarly, if +$A\subset B$, $\varphi(a,b)$ may be less than $1$ in the case where +the inductive confidence, measured by statistical surprise, is +insufficient. + +{\bf \remark Total correlation, partial correlation} + + +We take here the notion of correlation in a more general sense than +that used in the domain that develops the linear correlation +coefficient (linear link measure) or the correlation ratio (functional +link measure). +In our perspective, there is a total (or partial) correlation between +two variables $a$ and $b$ when the respective events they determine +occur (or almost occur) at the same time, as well as their opposites. +However, we know from numerical counter-examples that correlation and +implication do not come down to each other, that there can be +correlation without implication and vice versa~\cite{Grasf} and below. +If we compare the implication coefficient and the linear correlation +coefficient algebraically, it is clear that the two concepts do not +coincide and therefore do not provide the same +information\footnote{"More serious is the logical error inferred from + a correlation found to the existence of a causality" writes Albert + Jacquard in~\cite{Jacquard}, p.159. }. + +The quasi-implication of non-symmetric index $q(a,\overline{b})$ does +not coincide with the correlation coefficient $\rho(a, b)$ which is +symmetric and which reflects the relationship between variables a and +b. Indeed, we show~\cite{Grasf} that if $q(a,\overline{b}) \neq 0$ +then +$$\frac{\rho(a,b)}{q(a,\overline{b})} = \sqrt{\frac{n}{n_b + n_{\overline{a}}}} q(a,\overline{b})$$ +With the correlation considered from the point of view of linear +correlation, even if correlation and implication are rather in the +same direction, the orientation of the relationship between two +variables is not transparent because it is symmetrical, which is not +the bias taken in the SIA. +From a statistical relationship given by the correlation, two opposing +empirical propositions can be deduced. + +The following dual numerical situation clearly illustrates this: + + +\begin{table}[htp] +\center +\begin{tabular}{|l|c|c|c|}\hline +\diagbox[width=4em]{$a_1$}{$b_1$}& + 1 & 0 & marge\\ \hline + 1 & 96 & 4& 100 \\ \hline + 0 & 50 & 50& 100 \\ \hline + marge & 146 & 54& 200 \\ \hline +\end{tabular} ~ ~ ~ ~ ~ ~ ~ \begin{tabular}{|l|c|c|c|}\hline +\diagbox[width=4em]{$a_2$}{$b_2$}& + 1 & 0 & marge\\ \hline + 1 & 94 & 6& 100 \\ \hline + 0 & 52 & 48& 100 \\ \hline + marge & 146 & 54& 200 \\ \hline +\end{tabular} + +\caption{Numeric example of difference between implication and + correlation} +\label{chap2tab1} +\end{table} + +In Table~\ref{chap2tab1}, the following correlation and implications +can be computed:\\ +Correlation $\rho(a_1,b_1)=0.468$, Implication +$q(a_1,\overline{b_1})=-4.082$\\ +Correlation $\rho(a_2,b_2)=0.473$, Implication $q(a_2,\overline{b_2})=-4.041$ + + +Thus, we observe that, on the one hand, $a_1$ and $b_1$ are less +correlated than $a_2$ and $b_2$ while, on the other hand, the +implication intensity of $a_1$ over $b_1$ is higher than that of $a_2$ +over $b_2$ since $q1 0 $$ + + +$$ \frac{\partial + q}{\partial n_{a \wedge + \overline{b}}} = \frac{1}{\sqrt{\frac{n_a n_{\overline{b}}}{n}}} += \frac{1}{\sqrt{\frac{n_a (n-n_b)}{n}}} > 0 $$ + +Thus, if the increases $\Delta nb$ and $\Delta n_{a \wedge + \overline{b}}$ are positive, the increase of $q(a,\overline{b})$ is +also positive. This is interpreted as follows: if the number of +examples of $b$ and the number of counter-examples of implication +increase then the intensity of implication decreases for $n$ and $n_a$ +constant. In other words, this intensity of implication is maximum at +observed values $n_b$ and $ n_{a \wedge + \overline{b}}$ and minimum at values $n_b+\Delta n_b$ and $n_{a \wedge + \overline{b}}+ n_{a \wedge + \overline{b}}$.