X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_chic.git/blobdiff_plain/c9f45698e9a535650b20a516d197dca86ed70d90..HEAD:/chapter2.tex?ds=sidebyside diff --git a/chapter2.tex b/chapter2.tex index ed3ec5b..3b35fe8 100644 --- a/chapter2.tex +++ b/chapter2.tex @@ -313,8 +313,8 @@ $a\Rightarrow b$, for $n_a\leq n_b$ and $nb \neq n$, is then defined from the index $q(a,\overline{b})$ by: \definition -The implication intensity that measures the inductive quality of a -over b is: +The implication intensity that measures the inductive quality of $a$ +over $b$ is: $$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] = \frac{1}{\sqrt{2 \pi}} \int^{\infty}_{ q(a,\overline{b})} e^{-\frac{t^2}{2}} dt,~ if~ n_b \neq n$$ @@ -384,16 +384,16 @@ The following dual numerical situation clearly illustrates this: \center \begin{tabular}{|l|c|c|c|}\hline \diagbox[width=4em]{$a_1$}{$b_1$}& - 1 & 0 & marge\\ \hline + 1 & 0 & margin\\ \hline 1 & 96 & 4& 100 \\ \hline 0 & 50 & 50& 100 \\ \hline - marge & 146 & 54& 200 \\ \hline + margin & 146 & 54& 200 \\ \hline \end{tabular} ~ ~ ~ ~ ~ ~ ~ \begin{tabular}{|l|c|c|c|}\hline \diagbox[width=4em]{$a_2$}{$b_2$}& - 1 & 0 & marge\\ \hline + 1 & 0 & margin\\ \hline 1 & 94 & 6& 100 \\ \hline 0 & 52 & 48& 100 \\ \hline - marge & 146 & 54& 200 \\ \hline + margin & 146 & 54& 200 \\ \hline \end{tabular} \caption{Numeric example of difference between implication and @@ -907,7 +907,8 @@ is constant, independent of the rate of decrease of this number, of the variations of $n$ and $n_b$. This property seems not to satisfy intuition. The gradient of $c$ is expressed only in relation to $n_{a \wedge - \overline{b}}$ and $n_a$:(). {\bf CHECK FORMULA} + \overline{b}}$ and $n_a$: $\displaystyle \binom{ -\frac{1}{n_a}}{\frac{n_{a \wedge b}}{n_a^2}}$ + This may also appear to be a restriction on the role of parameters in expressing the sensitivity of the index. @@ -939,3 +940,509 @@ $$\frac{\partial}{\partial n_{a\wedge \overline{b}}}\left( \frac{\partial q}{\partial n_{a\wedge \overline{b}}} \right) $$ and the same for the other variables taken in pairs. However, we have, through the formulas (\ref{eq2.3}) and (\ref{eq2.4}) + +$$ \frac{\partial}{\partial n_{a \wedge b}} \left( \frac{\partial q}{\partial n_b} \right) = \frac{1}{2} \left( \frac{n_a}{n}\right)^{-\frac{1}{2}} \left( \frac{n_{\overline{b}}}{n}\right)^{-\frac{3}{2}} = \frac{\partial}{\partial n_b}\left( +\frac{\partial q}{\partial n_{a\wedge \overline{b}}} \right)$$ + +Thus, to the vector field C = ($n$, $n_a$, $n_b$, $n_{\overline{b}}$) of $E$, the nature of which we will specify, corresponds a gradient field $G$ which is said to be derived from the {\bf potential} $q$. +The gradient grad $q$ is therefore the vector that represents the spatial variation of the field intensity. +It is directed from low field values to higher values. By following the gradient at each point, we follow the increase in the intensity of the field's implication in space and, in a way, the speed with which it changes as a result of the variation of one or more parameters. + +For example, if we set 3 of the parameters $n$, $n_a$, $n_b$, $n_{\overline{b}}$ given by the realization of the couple ($a$, $b$), the gradient is a vector whose direction indicates the growth or decrease of $q$, therefore the decrease or increase of $|q|$ and, as a consequence of $\varphi$ the variations of the 4th parameter. +We have indicated this above by interpreting formula (\ref{eq2.5}). + + +\subsection{Level or equipotential lines} +An equipotential (or level) line or surface in the $C$ field is a curve of $E$ along which or on which a variable point $M$ maintains the same value of the potential $q$ (e.g. isothermal lines on the globe or level lines on an IGN map). + +The equation of this surface\footnote{In differential geometry, it seems that this surface is a (quasi) differentiable variety on board, compact, homeomorphic with closed pavement of the intervals of variation of the 4 parameters. Note that the point whose component $n_b$ is equal to $n$ (therefore = 0) is a singular point ( "catastrophic" in René Thom's sense) of the surface and $q$, the potential, is not differentiable at this point. Everywhere else, the surface is distinguishable, the points are all regular. If time, for example, parameters the observations of the process of which ($n$, $n_a$, $n_b$, $n_{\overline{b}}$) is a realization, at each instant corresponds a morphological fiber of the process represented by such a surface in space-time.} is, of course: +$$ q(a,\overline{b}) - \frac{n_{a \wedge \overline{b}}- + \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}}} = 0$$ + + +Therefore, on such a curve, the scalar product $grad~ q. dM$ is zero. +This is interpreted as indicating the orthogonality of the gradient with the tangent or hyperplane tangent to the curve, i.e. with the equipotential line or surface. +In a kinematic interpretation of our problem, the velocity of $M$'s path on the equipotential surface is orthogonal to the gradient in $M$. + +As an illustration in Figure~\ref{chap2fig2}, for a potential $F$ depending on only 2 variables, the figure below shows the orthogonal direction of the gradient with respect to the different equipotential surfaces along which the potential $F$ does not vary but passes from $F=7$ to $F= 10$. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1]{chap2fig2} + \caption{Illustration of potential of 2 variables} +\label{chap2fig2} % Give a unique label +\end{figure} + +It is possible in the case of the potential $q$, to build equipotential surfaces as above (two-dimensional for ease of representation). +It is understandable that the more intense the field is, the tighter the surfaces are. For a given value of $q$, in this case, 3 variables are set, for example $n$, $n_a$, $n_b$ and a value of $q$ compatible with the field constraints. Either: $n = 104$; $n_a = 1600 \leq nb = 3600$ and $q = -2$ or $|q| = 2$. We then find $n_{\overline{b}}= 528$ using formula~(\ref{eq2.1}). +But the points ($10^4$, $1600$, $5100$, $5100$, $728$) and ($100$, $25$, $64$, $3$) also belong to this surface and the same equipotential curve. +The point ($104$, $1600$, $3600$, $3600$, $928$) belongs to the equipotential curve $q=-3$). In fact, on this entire surface, we obtain a kind of homeostasis of the intensity of implication. + +The expression of the function $q$ of the variable shows that it is convex. +This property proves that the segment of points $t.M_1 + (1-t).M_2$, for $t \in [0,1]$ which connects two points $M_1$ and $M_2$ of the same equipotential line is entirely contained in its convexity. +The figure below shows two adjacent equipotential surfaces $\sum_1$ and $\sum_2$ in the implicit field corresponding to two values of the potential $q_1$ and $q_2$. +At point $M_1$ the scalar field therefore takes the value $q_1$. $M_2$ is the intersection of the normal from $M_1$ with $\sum_2$. Given the direction of the normal vector $\vec{n}$ the difference $\delta = q2 - q1$, variation of the field when we go from $\sum_1$ to $\sum_2$ is then equal to the opposite of the norm of the gradient from $q$ to $M_1$ is $\frac{\partial q}{\partial n}$, if $n_a$, $n_b$ and $n_{a \wedge \overline{b}}$ are fixed. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1]{chap2fig3} + \caption{Illustration of equipotential surfaces} +\label{chap2fig3} % Give a unique label +\end{figure} + +Thus, the space $E$ can be laminated by equipotential surfaces corresponding to successive values of $q$ relative to the cardinals ($n$, $n_a$, $n_b$, $n_{a \wedge \overline{b}}$) which would be varied. +This situation corresponds to the one envisaged in the SIA modeling. +Fixing $n$, $n_a$ and $n_b$, we consider the random sets $X$ and $Y$ of the same cardinals as $A(n_a)$ and $B(n_b)$ and whose cardinal follows a Poisson's law or a binomial law, according to the choice of the model. +The different gradient fields, real "lines of force", associated with them are orthogonal to the surfaces defined by the corresponding values of $Q$. +This reminds us, in the theoretical framework of potential, of the premonitory metaphor of "implicit flow" that we expressed in~\cite{Grase} and that we will discuss again in Chapter 14 of the book. +Behind this notion we can imagine a transport of information of variable intensity in a causal universe. +We illustrate this metaphor with the study of the properties of the two-layer implicit cone (see §2.8). +Moreover and intuitively, the implication $a\Rightarrow b$ is of as good quality as the equipotential surface $C$ of the contingency covers random equipotential surfaces depending on the random variable. +Let us recall the relationship that unites the potential q with the intensity: +$$\varphi(a,b) =\frac{1}{\sqrt{2\pi}}\int_{q(a,\overline{b})}^{\infty}e^{-\frac{t^2}{2}} dt$$ + +\noindent {\bf remark 1}\\ +It can be seen that the intensity is also invariant on any equipotential surface of its own variations. +The surface portions generated by $q$ and by $\varphi$ are even in one-to-one correspondence. +In intuitive terms, we can say that when one "swells" the other "deflates".\\ + +\noindent {\bf remark 2}\\ +Let us note once again a particularity of the intensity of implication. +While the surfaces generated by the variations of the 4 parameters of the data are not invariant by the same dilation of the parameters, those associated with the indices cited in §2.4 are invariant and have the same undifferentiated geometric shape. + +\section{Implication-inclusion} +\subsection{Foundational and problematic situation} +Three reasons led us to improve the model formalized by the intensity of involvement: +\begin{itemize} +\item when the size of the samples processed, and in particular that of $E$, increases (by around a thousand and more), the intensity $\varphi(a,b)$ no longer tends to be sufficiently discriminating because its values can be very close to 1, while the inclusion whose quality it seeks to model is far from being satisfied (phenomenon reported in~\cite{Bodina} which deals with large student populations through international surveys); +\item the previous quasi-implication model essentially uses the measure of the strength of rule $a \Rightarrow b$. + However, taking into account a concomitance of $\neg b \Rightarrow \neg a$ (contraposed of implication) is useful or even essential to reinforce the affirmation of a good quality of the quasi-implicative, possibly quasi-causal, relationship of $a$ over $b$\footnote{This phenomenon is reported by Y. Kodratoff in~\cite{Kodratoff}.}. + At the same time, it could make it possible to correct the difficulty mentioned above (if $A$ and $B$ are small compared to $E$, their complementary will be important and vice versa); +\item the overcoming of Hempel's paradox (see Appendix 3 of this chapter). + \end{itemize} + +\subsection{An inclusion index} + +The solution\footnote{J. Blanchard provides in~\cite{Blanchardb} an answer to this problem by measuring the "equilibrium gap".} we provide uses both the intensity of implication and another index that reflects the asymmetry between situations $S_1 = (a \wedge b)$ and $S_1' = (a \wedge \neg b)$, (resp. $S2 = (\neg a \wedge \neg b)$ and $S_2' = (a \wedge \neg b)$) in favour of the first named. +The relative weakness of instances that contradict the rule and its counterpart is therefore fundamental. +Moreover, the number of counter-examples $n_{a \wedge \overline{b}}$ to $a\ Rightarrow b$ is the one to the contraposed one. +To account for the uncertainty associated with a possible bet of belonging to one of the two situations ($S_1$ or $S_1'$, (resp. $S_2$ or $S_2'$)), we therefore refer to Shannon's concept of entropy~\cite{Shannon}: +$$H(b\mid a) = - \frac{n_{a\wedge b}}{n_a}log_2 \frac{n_{a\wedge b}}{n_a} - \frac{n_{a\wedge \overline{b}}}{n_a}log_2 \frac{n_{a\wedge \overline{b}}}{n_a}$$ +is the conditional entropy relating to boxes $(a \wedge b)$ and $(a \wedge \neg b)$ when $a$ is realized + +$$H(\overline{a}\mid \overline{b}) = - \frac{n_{a\wedge \overline{b}}}{n_{\overline{b}}}log_2 \frac{n_{a\wedge \overline{b}}}{n_{\overline{b}}} - \frac{n_{\overline{a} \wedge \overline{b}}}{n_{\overline{b}}}log_2 \frac{n_{\overline{a} \wedge \overline{b}}}{n_{\overline{b}}}$$ + +is the conditional entropy relative to the boxes $(\neg a \wedge \neg b)$ and $(a \wedge \neg b)$ when not $b$ is realized. + +These entropies, with values in $[0,1]$, should therefore be simultaneously weak and therefore the asymmetries between situations $S_1$ and $S_1'$ (resp. $S_2$ and $S_2'$) should be simultaneously strong if one wishes to have a good criterion for including $A$ in $B$. +Indeed, entropies represent the average uncertainty of experiments that consist in observing whether b is performed (or not a is performed) when a (or not b) is observed. The complement to 1 of this uncertainty therefore represents the average information collected by performing these experiments. The more important this information is, the stronger is the guarantee of the quality of the involvement and its counterpart. We must now adapt this entropic numerical criterion to the model expected in the different cardinal situations. +For the model to have the expected meaning, it must satisfy, in our opinion, the following epistemological constraints: + +\begin{enumerate} +\item It shall integrate the entropy values and, to contrast them, for example, integrate these values into the square. +\item As this square varies from 0 to 1, in order to denote the imbalance and therefore the inclusion, in order to oppose entropy, the value retained will be the complement to 1 of its square as long as the number of counter-examples is less than half of the observations of a (resp. non b). + Beyond these values, as the implications no longer have an inclusive meaning, the criterion will be assigned the value 0. +\item In order to take into account the two information specific to $a\Rightarrow b$ and $\neg b \Rightarrow \neg a$, the product will report on the simultaneous quality of the values retained. +The product has the property of cancelling itself as soon as one of its terms is cancelled, i.e. as soon as this quality is erased. +\item Finally, since the product has a dimension 4 with respect to entropy, its fourth root will be of the same dimension. +\end{enumerate} + +Let $\alpha=\frac{n_a}{n}$ be the frequency of a and $\overline{b}=\frac{n_{\overline{b}}}{n}$ be the frequency of non b. +Let $t=\frac{n_{a \wedge \overline{b}}}{n}$ be the frequency of counter-examples, the two significant terms of the respective qualities of involvement and its counterpart are: + +\begin{eqnarray*} + h_1(t) = H(b\mid a) = - (1-\frac{t}{\alpha}) log_2 (1-\frac{t}{\alpha}) - \frac{t}{\alpha} log_2 \frac{t}{\alpha} & \mbox{ if }t \in [0,\frac{\alpha}{2}[\\ + h_1(t) = 1 & \mbox{ if }t \in [\frac{\alpha}{2},\alpha]\\ + h_2(t)= H(\overline{a}\mid \overline{b}) = - (1-\frac{t}{\overline{\beta}}) log_2 (1-\frac{t}{\overline{\beta}}) - \frac{t}{\overline{b}} log_2 \frac{t}{\overline{b}} & \mbox{ if }t \in [0,\frac{\overline{\beta}}{2}[\\ + h_2(t)= 1 & \mbox{ if }t \in [\frac{\overline{\beta}}{2},\overline{\beta}] +\end{eqnarray*} +Hence the definition for determining the entropic criterion: +\definition: The inclusion index of A, support of a, in B, support of b, is the number: +$$i(a,b) = \left[ (1-h_1^2(t)) (1-h_2^2(t))) \right]^{\frac{1}{4}}$$ + +which integrates the information provided by the realization of a small number of counter-examples, on the one hand to the rule $a \Rightarrow b$ and, on the other hand, to the rule $\neg b \Rightarrow \neg a$. + +\subsection{The implication-inclusion index} + +The intensity of implication-inclusion (or entropic intensity), a new measure of inductive quality, is the number: + +$$\psi(a,b)= \left[ i(a,b).\varphi(a,b) \right]^{\frac{1}{2}}$$ +which integrates both statistical surprise and inclusive quality. + +The function $\psi$ of the variable $t$ admits a representation that has the shape indicated in Figure~\ref{chap2fig4}, for $n_a$ and $n_b$ fixed. +Note in this figure the difference in the behaviour of the function with respect to the conditional probability $P(B\mid A)$, a fundamental index of other rule measurement models, for example in Agrawal. +In addition to its linear, and therefore not very nuanced nature, this probability leads to a measure that decreases too quickly from the first counter-examples and then resists too long when they become important. + + +\begin{figure}[htbp] + \centering +\includegraphics[scale=0.5]{chap2fig4.png} +\caption{Example of implication-inclusion.} + +\label{chap2fig4} +\end{figure} + +In Figure~\ref{chap2fig4}, it can be seen that this representation of the continuous function of $t$ reflects the expected properties of the inclusion criterion: +\begin{itemize} +\item ``Slow reaction'' to the first counter-examples (noise resistance), +\item ``acceleration'' of the rejection of inclusion close to the balance i.e. $\frac{n_a}{2n}$, +\item rejection beyond $\frac{n_a}{2n}$, the intensity of implication $\varphi(a,b)$ did not ensure it. +\end{itemize} + +\noindent Example 1\\ +\begin{tabular}{|c|c|c|c|}\hline + & $b$ & $\overline{b}$ & margin\\ \hline + $a$ & 200 & 400& 600 \\ \hline + $\overline{a}$ & 600 & 2800& 3400 \\ \hline + margin & 800 & 3200& 4000 \\ \hline +\end{tabular} +\\ +\\ +In Example 1, implication intensity is $\varphi(a,b)=0.9999$ (with $q(a,\overline{b})=-3.65$). + The entropic values of the experiment are $h_1=h_2=0$. + The value of the moderator coefficient is therefore $i(a,b)=0$. + Hence, $\psi(a,b)=0$ whereas $P(B\mid A)=0.33$. +Thus, the "entropic" functions "moderate" the intensity of implication in this case where inclusion is poor. +\\ +\\ +\noindent Example 2\\ + \begin{tabular}{|c|c|c|c|}\hline + & $b$ & $\overline{b}$ & margin\\ \hline + $a$ & 400 & 200& 600 \\ \hline + $\overline{a}$ & 1000 & 2400& 3400 \\ \hline + margin & 1400 & 2600& 4000 \\ \hline + \end{tabular} + \\ + \\ + In Example 2, intensity of implication is 1 (for $q(a,\overline{b}) = - 8.43$). + The entropic values of the experiment are $h_1 = 0.918$ and $h_2 = 0.391$. + The value of the moderator coefficient is therefore $i(a,b) = 0.6035$. + As a result $\psi(a,b) = 0.777$ whereas $P(B \mid A) = 0.6666$. + \\ + \\ +{\bf remark} + \noindent The correspondence between $\varphi(a,b)$ and $\psi(a,b)$ is not monotonous as shown in the following example: + +\begin{tabular}{|c|c|c|c|}\hline + & $b$ & $\overline{b}$ & margin\\ \hline + $a$ & 40 & 20& 60 \\ \hline + $\overline{a}$ & 60 & 280& 340 \\ \hline + margin & 100 & 300& 400 \\ \hline +\end{tabular} +\\ +Thus, while $\varphi(a,b)$ decreased from the 1st to the 2nd example, $i(a,b)$ increased as well as $\psi(a,b)$. On the other hand, the opposite situation is the most frequent. +Note that in both cases, the conditional probability does not change. +\\ +\\ +{\bf remark} +\noindent We refer to~\cite{Lencaa} for a very detailed comparative study of association indices for binary variables. +In particular, the intensities of classical and entropic (inclusion) implication presented in this article are compared with other indices according to a "user" entry. + +\section{Implication graph} +\subsection{Problematic} + +At the end of the calculations of the intensities of implication in both the classical and entropic models, we have a table $p \times p$ that crosses the $p$ variables with each other, whatever their nature, and whose elements are the values of these intensities of implication, numbers of the interval $[0,~1]$. +It must be noted that the underlying structure of all these variables is far from explicit and remains largely unimportant. +The user remains blind to such a square table of size $p^2$. +It cannot simultaneously embrace the possible multiple sequences of rules that underlie the overall structure of all $p$ variables. +In order to facilitate a clearer extraction of the rules and to examine their structure, we have associated to this table, and for a given intensity threshold, an oriented graph, weighted by the intensities of implication, without a cycle whose complexity of representation the user can control by setting himself the threshold for taking into account the implicit quality of the rules. +Each arc in this graph represents a rule: if $n_a < n_b$, the arc $a \rightarrow b$ represents the rule $a \Rightarrow b$ ; if $n_a = n_b$, then the arc $a \leftrightarrow b$ will represent the double rule $a \Leftrightarrow b$, in other words, the equivalence between these two variables. +By varying the threshold of intensity of implication, it is obvious that the number of arcs varies in the opposite direction: for a threshold set at $0.95$, the number of arcs is less than or equal to those that would constitute the graph at threshold $0.90$. We will discuss this further below. + +\subsection{Algorithm} + +The relationship defined by statistical implication, if it is reflexive and not symmetrical, is obviously not transitive, as is induction and, on the contrary, deduction. +However, we want it to model the partial relationship between two variables (the successes in our initial example). +By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, we will accept the transitive closure $a \Rightarrow c$ only if $\varphi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ to $c$ is better than neutrality by emphasizing the dependence between $a$ and $c$. +\\ + +{\bf Proposal:} By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, there is a transitive closure $a \Rightarrow c$ if and only if $\varphi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ over $c$, which reflects a certain dependence between $a$ and $c$, is better than its refutation. +Note that for any pair of variables $(x;~ y)$, the arc $x \rightarrow y$ is weighted by the intensity of involvement (x,y). +\\ +Let us take a formal example by assuming that between the 5 variables $a$, $b$, $c$, $d$, and $e$ exist, at the threshold above $0.5$, the following rules: $c \Rightarrow a$, $c \Rightarrow e$, $c \Rightarrow b$, $d \Rightarrow a$, $d \Rightarrow e$, $a \Rightarrow b$ and $a \Rightarrow e$. + +This set of numerical and graphical relationships can then be translated into the following table and graph: + +\begin{tabular}{|C{0.5cm}|c|c|c|c|c|}\hline +\hspace{-0.5cm}\turn{45}{$\Rightarrow$} & $a$ & $b$ & $c$ & $d$ & $e$\\ \hline +$a$ & & 0.97& & & 0.73 \\ \hline +$b$ & & & & & \\ \hline + $c$ & 0.82 & 0.975& & & 0.82 \\ \hline + $d$ & 0.78 & & & & 0.92 \\ \hline + $e$ & & & & & \\ \hline +\end{tabular} + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1]{chap2fig5.png} +\caption{Implication graph corresponding to the previous example.} + +\label{chap2fig5} +\end{figure} + +One of the difficulties related to the graphical representation is that the graph is not planar. +The algorithm that allows its construction must take it into account and, in particular, must "straighten" the paths of the graph in order to allow an acceptable readability for the expert who will analyze it. + +The number of arcs in the graph can be reduced (or increased) if we raise (or lower) the acceptance threshold of the rules, the level of confidence in the selected rules. +Correlatively, arcs can appear or disappear depending on the variations of the threshold. +Let us recall that this graph is necessarily without cycle, that it is not a lattice since, for example, the variable $a$ does not imply the variable ($a$ or $\neg a$) whose support is $E$. +A fortiori, it cannot be a Galois lattice. +Options of the CHIC software for automatic data processing with SIA, allow to delete variables at will, to move their image in the graph in order to decrease the arcs or to focus on certain variables called vertices of a kind of "cone" whose two "plots" are made up respectively of the variables "parents" and the variables "children" of this vertex variable. +We refer to the ends of the arcs as "nodes". A node in a given graph has a single variable or a conjunction of variables. +The transition from a node $S_1$ to a node $S_2$ is also called "transition" which is represented by an arc in the graph. +The upper slick of the vertex cone the variable $a$, called the nodal variable, is made up of the "fathers" of $a$, either in the "causal" sense the causes of $a$ ; the lower slick, on the other hand, is made up of the "children" of $a$ and therefore, always in the causal sense, the consequences or effects of $a$. +The expert in the field analysed here must be particularly interested in these configurations, which are rich in information. +See, for example~\cite{Lahanierc} and the two implicit cones below (i.e. Figures~\ref{chap2fig6} and \ref{chap2fig7}). + +\begin{figure}[htbp] + \centering +\includegraphics[scale=0.75]{chap2fig6.png} +\caption{Implicative cone.} + +\label{chap2fig6} +\end{figure} + +\begin{figure}[htbp] + \centering +\includegraphics[scale=0.75]{chap2fig7.png} +\caption{Implicative cone centered on a variable.} + +\label{chap2fig7} +\end{figure} + + +\section{Reduction in the number of variables} +\subsection{Motivation} + + +As soon as the number of variables becomes excessive, most of the available techniques become impractical\footnote{This paragraph is strongly inspired by paper~\cite{Grask}.}. +In particular, when an implicitive analysis is carried out by calculating association rules~\cite{Agrawal}, the number of rules discovered undergoes a combinatorial explosion with the number of variables, and quickly becomes inextricable for a decision-maker, provided that variable conjunctions are requested. +In this context, it is necessary to make a preliminary reduction in the number of variables. + +Thus, ~\cite{Ritschard} proposed an efficient heuristic to reduce both the number of rows and columns in a table, using an association measure as a quasi-optimal criterion for controlling the heuristic. +However, to our knowledge, in the various other research studies, the type of situation at the origin of the need to group rows or columns is not taken into account in the reduction criteria, whether the analyst's problem and aim are the search for similarity, dissimilarity, implication, etc., between variables. + +Also, to the extent that there are very similar variables in the sense of statistical implication, it might be appropriate to substitute a single variable for these variables that would be their leader in terms of representing an equivalence class of similar variables for the implicit purpose. +We therefore propose, following the example of what is done to define the notion of quasi-implication, to define a notion of quasi-equivalence between variables, in order to build classes from which we will extract a leader. +We will illustrate this with an example. +Then, we will consider the possibility of using a genetic algorithm to optimize the choice of the representative for each quasi-equivalence class. + +\subsection{Definition of quasi-equivalence} + +Two binary variables $a$ and $b$ are logically equivalent for the SIA when the two quasi-implications $a \Rightarrow b$ and $b \Rightarrow a$ are simultaneously satisfied at a given threshold. +We have developed criteria to assess the quality of a quasi-involvement: one is the statistical surprise based on the likelihood of~\cite{Lerman} relationship, the other is the entropic form of quasi-inclusion~\cite{Grash2} which is presented in this chapter (§7). + +According to the first criterion, we could say that two variables $a$ and $b$ are almost equivalent when the intensity of involvement $\varphi(a,b)$ of $a\Rightarrow b$ is little different from that of $b \Rightarrow a$. However, for large groups (several thousands), this criterion is no longer sufficiently discriminating to validate inclusion. + +According to the second criterion, an entropic measure of the imbalance between the numbers $n_{a \wedge b}$ (individuals who satisfy $a$ and $b$) and $n_{a \wedge \overline{b}} $ (individuals who satisfy $a$ and $\neg b$, counter-examples to involvement $a\Rightarrow b$) is used to indicate the quality of involvement $a\Rightarrow b$, on the one hand, and the numbers $n_{a \wedge b}$ and $n_{\overline{a} \wedge b}$ to assess the quality of mutual implication $b\Rightarrow a$, on the other. + + +Here we will use a method comparable to that used in Chapter 3 to define the entropic implication index. + +By posing $n_a$ and $n_b$, respectively effective of $a$ and $b$, the imbalance of the rule $a\Rightarrow b$ is measured by a conditional entropy $K(b \mid a=1)$, and that of $b\Rightarrow a$ by $K(a \mid b=1)$ with: + + +\begin{eqnarray*} + K(b\mid a=1) = - \left( 1- \frac{n_{a\wedge b}}{n_a}\right) log_2 \left( 1- \frac{n_{a\wedge b}}{n_a}\right) - \frac{n_{a\wedge b}}{n_a}log_2 \frac{n_{a\wedge b}}{n_a} & \quad if \quad \frac{n_{a \wedge b}}{n_a} > 0.5\\ + K(b\mid a=1) = 1 & \quad if \quad \frac{n_{a \wedge b}}{n_a} \leq 0.5\\ + K(a\mid b=1) = - \left( 1- \frac{n_{a\wedge b}}{n_b}\right) log_2 \left( 1- \frac{n_{a\wedge b}}{n_b}\right) - \frac{n_{a\wedge b}}{n_b}log_2 \frac{n_{a\wedge b}}{n_b} & \quad if \quad \frac{n_{a \wedge b}}{n_b} > 0.5\\ + K(a\mid b=1) = 1 & \quad if \quad \frac{n_{a \wedge b}}{n_b} \leq 0.5 +\end{eqnarray*} + +These two entropies must be low enough so that it is possible to bet on $b$ (resp. $a$) with a good certainty when $a$ (resp. $b$) is achieved. Therefore their respective complements to 1 must be simultaneously strong. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=0.5]{chap2fig8.png} +\caption{Illustration of the functions $K$ et $1-K^2$ on $[0; 1]$ .} + +\label{chap2fig8} +\end{figure} + + +\definition A first entropic index of equivalence is given by: +$$e(a,b) = \left (\left[ 1 - K^2(b \mid a = 1)\right ]\left[ 1 - K^2(a \mid b = 1) \right]\right)^{\frac{1}{4}}$$ + +When this index takes values in the neighbourhood of $1$, it reflects a good quality of a double implication. +In addition, in order to better take into account $a \wedge b$ (the examples), we integrate this parameter through a similarity index $s(a,b)$ of the variables, for example in the sense of I.C. Lerman~\cite{Lermana}. +The quasi-equivalence index is then constructed by combining these two concepts. + +\definition A second entropic equivalence index is given by the formula + +$$\sigma(a,b)= \left [ e(a,b).s(a,b)\right ]^{\frac{1}{2}}$$ + +From this point of view, we then set out the quasi-equivalence criterion that we use. + +\definition The pair of variables $\{a,b\}$ is said to be almost equivalent for the selected quality $\beta$ if $\sigma(a,b) \geq \beta$. +For example, a value $\beta=0.95$ could be considered as a good quasi-equivalence between $a$ and $b$. + +\subsection{Algorithm of construction of quasi-equivalence classes} + +Let us assume a set $V = \{a,b,c,...\}$ of $v$ variables with a valued relationship $R$ induced by the measurement of quasi-equivalence on all pairs of $V$. +We will assume the pairs of variables classified in a decreasing order of quasi-equivalence. +If we have set the quality threshold for quasi-equivalence at $\beta$, only the first of the pairs $\{a,b\}$ checking for inequality $\sigma(a,b)\ge \beta$ will be retained. +In general, only a part $V'$, of cardinal $v'$, of the variables of $V$ will verify this inequality. +If this set $V'$ is empty or too small, the user can reduce his requirement to a lower threshold value. +The relationship being symmetrical, we will have at most pairs to study. +As for $V-V'$, it contains only non-reducible variables. + +We propose to use the following greedy algorithm: +\begin{enumerate} +\item A first potential class $C_1^0= \{e,f\}$ is constituted such that $\sigma(e,f)$ represents the largest of the $\beta$-equivalence values. + If possible, this class is extended to a new class $C_1$ by taking from $V'$ all the elements $x$ such that any pair of variables within this class allows a quasi-equivalence greater than or equal to $\beta$; + +\item We continue with: + \begin{enumerate} + \item If $o$ and $k$ forming the pair $(o,k)$ immediately below $(e,f)$ according to the index $\sigma$, belong to $C_1$, then we move to the pair immediately below (o,k) and proceed as in 1.; + \item If $o$ and $k$ do not belong to $C_1$, proceed as in 1. from the pair they constitute by forming the basis of a new class; + \item If $o$ or $k$ does not belong to $C_1$, one of these two variables can either form a singleton class or belong to a future class. On this one, we will of course practice as above. + \end{enumerate} + \end{enumerate} + +After a finite number of iterations, a partition of $V$ is available in $r$ classes of $\sigma$-equivalence: $\{C_1, C_2,..., C_r\}$. +The quality of the reduction may be assessed by a gross or proportional index of $\beta^{\frac{r}{k}}$. +However, we prefer the criterion defined below, which has the advantage of integrating the choice of representative. + +In addition, $k$ variables representing the $k$ classes of $\sigma$-equivalence could be selected on the basis of the following elementary criterion: the quality of connection of this variable with those of its class. +However, this criterion does not optimize the reduction since the choice of representative is relatively arbitrary and may be a sign of triviality of the variable. + +\section{Conclusion} + +This overview of the development of implicit statistical analysis shows, if necessary, how a data processing theory is built step by step in response to problems presented by experts from various fields and in response to epistemological requirements that respect common sense and intuition. +It therefore appears differently than as a view of the mind since it is directly applicable to the situations that lead to its genesis. +The extensions made to the types of data processed, to the modes of representation of their structures, to the relationships between subjects, their descriptors and variables are indeed the result of the experts' greedy questions. +Its respective functions as developer and analyzer seem to operate successfully in multiple application areas. + +We will have noticed that the theoretical basis is simple, which could be the reason for its fertility. +Even if the questioning of primitive theoretical choices is not apparent here, this genesis has not been without conflicts between the expected answers, the ease of their access and therefore these answers have been sources of restoration or even redesign; often discussed within the research team. +In any case, this method of data analysis will have made it possible and will, Régis hopes, still make it possible to highlight living structures thanks to the non-symmetrical approach on which it is based. + +Among the current or future work proposed to our team, one concerns an extension of the SIA to vector variables in response to problems in proteomics. +Another is more broadly concerned with the relationship between SIA and the treatment of fuzzy sets (see Chapter 7). +The function of the "implication" fuzzy logic operator will be illustrated by new applications. +Through another subject, we will review our method to allow the SIA to solve the problem of data table vacancies, as well as the ongoing work on reducing redundant rules in SIA. +Finally, it is clear that this work will be conducted interactively with applications and, in particular, the contribution of SIA to the classification rule in the leaves of classification trees. + + + +\section{Annex1: Two models of the classical implication intensity} + +\subsection{Binomial model} + +To examine the quality of quasi-rule $a \Rightarrow b$, in the case where the variables are binary, is to measure equivalently that of the inclusion of the subset of transactions satisfying $a$ in the subset of transactions satisfying $b$. +The counter-examples relating to inclusion are indeed the same as those relating to the implication expressed by: "any satisfactory transaction $a$ has also satisfied $b$". +From this overall perspective, as soon as $n_a n_b$, the quality of the quasi-rule $a \Rightarrow b$, can only be semantically better than the one of $b \Rightarrow a$. +We will therefore assume, later on, that $n_a \leq n_b$ when studying $a \Rightarrow b$. In this case, the main population is finite and $Card~ E = n$. + +Binomial modelling was the first to be adopted chronologically (see~\cite{Grasb} chap. 2). +It was compared to other models in~\cite{Lermana}. +Let us briefly recall what the binomial model consists of. +With the adopted notations, $X$ and $Y$ are two random subsets, independently chosen from all the parts of $E$, respectively of the same cardinal $n_a$ and $n_b$ as the subsets of the realizations of $a$ and $b$. +The observed value $n_{a \wedge b}$ can be considered as the realization of a random variable $Card(X\cap Y)$ which represents the random number of counter-examples to the inclusion of $X$ in $Y$, counter-examples observed during $n$ successive independent draws. From there, $Card(X\cap \overline{Y})$ can be considered as a binomial variable of parameters $n$ and $\pi$ where $\pi$ is itself estimated by $p = \frac{n_a}{n}\frac{n_b}{n}$. Thus: + +$$Pr[Card(X\cap \overline{Y})= k]= C_n^k\left( \frac{n_an_{\overline{b}}}{n^2} \right)^k \left(1-\frac{n_a n_{\overline{b}}}{n^2} \right)^{n-k} $$ + +The estimated reduced centered variable $Q(a,~\overline{b})$ then accepts as a realization: + +$$q(a,\overline{b}) = \frac{n_{a \wedge \overline{b}}- + \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}(1-\frac{n_a n_{\overline{b}}}{n^2})} }$$ + +As before, we obtain the estimated intensity of empirical implication: +$$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] = 1 - \sum _0^{n_{a \wedge \overline{b}}} C_n^k\left (\frac{n_an_{\overline{b}}}{n^2}\right )^k\left (1-\frac{n_an_{\overline{b}}}{n^2}\right )^{n-k}$$ + + +The probability law of $Q(a,\overline{b})$ can be approximated by the one of the Laplace-Gauss law centred reduced $N(0,1)$. Generally, the intensity calculated in the Poisson model is more "severe" than the intensity derived from the binomial model in the sense that $\varphi(a,b)_{Poisson} \leq \varphi(a,b)_{Binomial}$. + +\remark We can note that the implication index is null if and only if the two variables $a$ and $b$ are independent. Indeed, we have +$$ q(a,\overline{b}) = \frac{n_{a \wedge \overline{b}}- + \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}(1-\frac{n_a n_{\overline{b}}}{n^2})} } =0 \iff n_{a \wedge \overline{b}}- \frac{n_a.n_{\overline{b}}}{n}=0$$ + +$$q(a,\overline{b}) =0 \iff n_{a \wedge \overline{b}}=\frac{n_a.n_{\overline{b}}}{n}~ \mbox{or }~ q(a,\overline{b}) =0 \iff \frac{n_a.n_{\overline{b}}}{n}=\frac{n_a}{n}\frac{n_{\overline{b}}}{n}$$ + +This last relationship reflects the property of statistical independence. + +\subsection{Hypergeometric model} +Let us briefly recall the 3rd modelling proposed in \cite{Lermana} and \cite{Grasd}. We repeat the same approach: $A$ and $B$ are the parts of $E$ representing the individuals satisfying $a$ and $b$ respectively and whose cardinals are $card (A)=n_a$ and $card (B)=n_b$. Then let us consider, two independent random parts $X$ and $Y$ such that $card (X)=n_a$ and $card (Y)=n_b$. The random variable $Card(A \cap \overline{Y})$ represents the random number of elements of $E$ which, being in $A$ are not in $Y$. This variable follows a hypergeometric law and we have for all $kn_a$: + +$$Pr[Card(A \cap \overline{Y})=k]=\frac{C_{n_a}^k C_{n-n_a}^{n-n_b-k}}{C_n^{n-n_b}} =\frac{n_a!n_{\overline{a}}! n_b!n_{\overline{b}}! }{k!n!(n_a-k)!(n_{\overline{b}}-k)! (n_b-n_a+k)! }$$ + +$$\frac{C_{n-n_b}^k C_{n_b}^{n_a-k}}{C_n^{n_a}} = Pr[Card(X \cap \overline{B})=k]$$ + +This shows, by exchanging the role of $a$ and $b$, that the empirical implication index $Q(a,\overline{b})$ corresponding to the quasi-rule $a \Rightarrow b$, is the same as the one corresponding to the reciprocal, i.e. $Q(b,\overline{a})$ . We thus obtain the same intensity for the quasi-rule $a \Rightarrow b$ and for the reciprocal quasi-rule $b \Rightarrow a$. + +\subsection{Choice of models to evaluate the intensity of implication} +If binomial modeling remains compatible with the semantics of implication, a non-symmetric binary relationship, the same cannot be said for hypergeometric modeling since it does not distinguish the quality of a quasi-rule from that of its reciprocal and has a low pragmatic character. +Consequently, we will only retain the Poisson model and the binomial model as models adapted to the semantics of involvement between binary variables. + + +The legitimate coexistence of three different models of our problem of measuring the quality of a quasi-rule is not inconsistent: it is due to the way in which the drawing of transactions (Poisson's law) or sets of grouped transactions (binomial law or hypergeometric law) is taken into account one by one. In addition, we know that when the total number of transactions becomes very large, all three models converge on the same Gaussian model. In~\cite{Lallich}, we find, as a generalization, a parameterization of the three indices obtained by these models, which allows us to evaluate the interest of the rules obtained by comparing them to a given threshold. + +\section{Annex 2: Modelling of implication integrating confidence and surprise} + +Recently, in~\cite{Grasab}, we have assembled two statistical concepts that we believe are internal to the implicit relationship between two variables $a$ and $b$: +\begin{itemize} +\item on the one hand, the intensity of involvement $\varphi(a,b)$ measuring surprise or astonishment at the low number of counter-examples to implication between these variables +\item on the other hand, the confidence $C(b \mid a)$ measuring the conditional frequency of $b$ knowing $a$ who is involved in the majority of the other implication indices as we have seen in §2.5.4. +\end{itemize} + +So, we claim, by plagiarizing G. Vergnaud~\cite{Vergnaudd} speaking about aesthetics, that there is no data analysis without {\bf confidence} (psychological level). But there is also no data analysis without {\bf surprise}\footnote{This is also what René Thom says in~\cite{Thoma} p. 130: (translated in english) "...the problem is not to describe reality, the problem is much more to identify in it what makes sense to us, what is surprising in all the facts. If the facts do not surprise us, they do not bring any new element to the understanding of the universe: we might as well ignore them" and further on: "... which is not possible if we do not already have a theory".} (statistical level), nor without {\bf scale correction} (pragmatic level). The two concepts (confidence and intensity of implication) therefore respond to relatively distinct but not contradictory principles: confidence is based on the subordination of variable $b$ to variable $a$ while intensity of implication is based on counter-examples to the subjection relationship of $b$ by $a$. + +It is demonstrated in~\cite{Grasab} that, for any $\alpha$ that the ratio + +$$ \frac{Pr[C(b\mid a)\geq \alpha]}{Pr[\varphi(a,b)\geq \alpha]}~\mbox{is close of}~ \frac{Pr[C(b \mid a) \geq \alpha}{1-\alpha}$$ + + +Under these conditions, this ratio is a good indicator of satisfaction between confidence and intensity of implication: greater than 1, confidence is then better than intensity; less than 1, intensity is stronger. Further research could be based on this indicator. + +Finally, as we did for entropic intensity, we will take into account the contraposed by associating the two conditional frequencies of b knowing a, i.e. $C_1(a,b)$ (for direct implication $a \Rightarrow b$) and $no~ a$ knowing $no~ b$, $C_2(a,b)$ (for contraposed implication $\neg b \Rightarrow \neg a$). Finally, we choose the following formula to define a new measure of implication that we call {\bf implifiance} in French (implication + confidence): + +$$ \phi(a,b)=\varphi(a,b).\left [ C_1(a,b).C_2(a,b) \right ]^{\frac{1}{4}}$$ + +For example, if we extract a rule whose implication is equal to $0.95$, its intensity of implication is at least equal to $0.95$ and each of the $C_1$ and $C_2$ confidences is at least equal to $0.81$. If the implication is equal to $0.90$, the respective minima are $0.90$ and $0.66$, which preserves the plausibility of the rule. + +The following two figures show the respective variations in intensity of implication, entropic intensity and implifiance in ordinates as a function of the number of counter-examples in cases $n=100$ and $n=1000$ (respectively in Figures~\ref{chap2fig9} and~\ref{chap2fig10}. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1.3]{chap2fig9.png} +\caption{Example of Implifiance with $n=100$.} + +\label{chap2fig9} +\end{figure} + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1.3]{chap2fig10.png} +\caption{Example of Implifiance with $n=1000$.} + +\label{chap2fig10} +\end{figure} + + +\section{Annex 3: SIA and Hempel's paradox} + +If we look at the SIA from the point of view of Knowledge Extraction, we find the main objective of the inductive establishment of rules and quasi-rules between variables $a$ and $b$ observed through instances $x$ of a set $E$ of objects or subjects. A strict rule (or theorem in this case) will be expressed in a symbolic form: $\forall x, (a(x)\Rightarrow b(x))$. A quasi-rule will present counter-examples, i.e. the following statement will be observed: $\exists x, (a(x)\wedge \overline{b(x)})$. + + +The purpose of the SIA is to provide a measure to such rules in order to estimate their quality when the frequency of the last statement above is low. +First, within the framework of the SIA, a quality index is constructed in order, like other indices, to provide a probabilistic response to this problem. +But in seeking among the rules\footnote{$n_{a \wedge \overline{b}}$} those that would express a causality, a causal relationship, or at least a causal relationship, it seemed absolutely necessary to us, as we said in point 4, to support the satisfaction of the direct rule by a measure of its contraposition: $\forall x, (\overline{b(x)} \Rightarrow \overline{a(x)})$. +Indeed, if statistically, whether with confidence measured by conditional frequency or with intensity of implication, the truth of a strict rule is also obtained with its counterpart, this is no longer necessarily the case with a quasi-rule. +We have also sought to construct in a new and original way a measure that makes it possible to overcome Hempel's paradox~\cite{Hempel} in order to obtain a measure that confirms the satisfaction of induction in terms of causality. + + +It should be recalled that, according to Carl G. Hempel, in strict logic, this paradox is linked to the irrelevance of contraposition in relation to induction, whereas empirical non-satisfaction (de facto) with premise $a$ is observed. +It is the consequence of the application of Hempel's 3rd principle: "If an observed object $x$ does not satisfy the antecedent (i.e. $a(x) = false$), it does not count or it is irrelevant in relation to the conditional (= the direct proposition)". +In other words, the confirmation of the contraposition does not provide anything as to the direct version of the proposal, although it is logically equivalent to it. +For example, it is not the confirmatory observation of the contraposition of "All crows are black" by a red cat (i. e. not black) that confirms the validity of "All crows are black". Nor, for that matter, by continuing to observe other non-black objects. Because to confirm this statement and thus validate the induction, we would have to review all the non-black objects that can be infinite in number. + +In other words, according to Hempel, in the implication truth table, cases where $a(x)$ is false are uninteresting for induction; only the lines [$a(x)=true$ and $b(x)=true$] that confirm the rule and [$a(x)=false$ and $b(x)=true$] that invalidate it, are retained. +\\ + +\underline{However, in SIA, this paradox does not hold for two reasons:} + + +\begin{enumerate} +\item the objects $x$ are part of the same finite or unfinite reference set $E$, i.e. infinite, countable and even continuous, in which all $x$ are likely, with relevance, to satisfy or not satisfy the variables at stake. That is, by assigning them a value (truth or numerical), the direct proposition and/or its counterpart are also evaluable (for example, proposition $a \Rightarrow b$ is true even if $a(x)$ is false while $b(x)$ is true); +\item Since we are most often dealing with quasi-rules, the equivalence between a proposal and its counterpart no longer holds, and it is on the basis of the combination of the respective and evaluated qualities of these statements that we induce or not a causal character. Moreover, if the rule is strict, the logical equivalence with its counterpart is strict and the counterpart rule is satisfied at the same time. +\end{enumerate}