update

[book_chic.git] / chapter2.tex
diff --git a/chapter2.tex b/chapter2.tex

index 06da12da2f182a02cfebbb8b00de6acba6325f70..3ffdbe5c9dfdbb567b3676ce2d410738dcb8b71d 100644 (file)
--- a/chapter2.tex
+++ b/chapter2.tex
@@ -313,8 +313,8 @@ $a\Rightarrow b$, for $n_a\leq n_b$ and  $nb \neq n$, is then defined
  from the index $q(a,\overline{b})$ by:
  
  \definition 
-The implication intensity  that measures the inductive quality of a
-over b is:
+The implication intensity  that measures the inductive quality of $a$
+over $b$ is:
  $$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] =
  \frac{1}{\sqrt{2 \pi}} \int^{\infty}_{ q(a,\overline{b})}
  e^{-\frac{t^2}{2}} dt,~ if~ n_b \neq n$$
@@ -1069,24 +1069,158 @@ The intensity of implication-inclusion (or entropic intensity), a new measure of
  $$\psi(a,b)= \left[  i(a,b).\varphi(a,b) \right]^{\frac{1}{2}}$$
  which integrates both statistical surprise and inclusive quality.
  
-The function $\psi$ of the variable $t$ admits a representation that has the shape indicated in Figure 4{\bf TO CHANGE}, for $n_a$ and $n_b$ fixed.
+The function $\psi$ of the variable $t$ admits a representation that has the shape indicated in Figure~\ref{chap2fig4}, for $n_a$ and $n_b$ fixed.
  Note in this figure the difference in the behaviour of the function with respect to the conditional probability $P(B\mid A)$, a fundamental index of other rule measurement models, for example in Agrawal.
  In addition to its linear, and therefore not very nuanced nature, this probability leads to a measure that decreases too quickly from the first counter-examples and then resists too long when they become important.
  
  
-{\bf FIGURE 4}
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=0.5]{chap2fig4.png}
+\caption{Example of implication-inclusion.}
+
+\label{chap2fig4}      
+\end{figure}
  
+In Figure~\ref{chap2fig4}, it can be seen that this representation of the continuous function of $t$ reflects the expected properties of the inclusion criterion:
+\begin{itemize}
+\item ``Slow reaction'' to the first counter-examples (noise resistance),
+\item ``acceleration'' of the rejection of inclusion close to the balance i.e. $\frac{n_a}{2n}$,
+\item rejection beyond $\frac{n_a}{2n}$,  the intensity of implication $\varphi(a,b)$ did not ensure it.
+\end{itemize}
  
  \noindent Example 1\\
- \begin{tabular}{|c|c|c|c|}\hline
+\begin{tabular}{|c|c|c|c|}\hline
    & $b$ & $\overline{b}$ & margin\\ \hline
    $a$ & 200 & 400& 600 \\ \hline
    $\overline{a}$ & 600 & 2800& 3400 \\ \hline
    margin & 800 & 3200& 4000 \\ \hline
- \end{tabular}
- \\
- In Example 1, implication intensity is $\varphi(a,b)=0.9999$ (with $q(a,\overline{b})=-3.65$).
+\end{tabular}
+\\
+\\
+In Example 1, implication intensity is $\varphi(a,b)=0.9999$ (with $q(a,\overline{b})=-3.65$).
   The entropic values of the experiment are $h_1=h_2=0$.
   The value of the moderator coefficient is therefore $i(a,b)=0$.
   Hence, $\psi(a,b)=0$ whereas $P(B\mid A)=0.33$.
  Thus, the "entropic" functions "moderate" the intensity of implication in this case where inclusion is poor.
+\\
+\\
+\noindent Example 2\\
+ \begin{tabular}{|c|c|c|c|}\hline
+  & $b$ & $\overline{b}$ & margin\\ \hline
+  $a$ & 400 & 200& 600 \\ \hline
+  $\overline{a}$ & 1000 & 2400& 3400 \\ \hline
+  margin & 1400 & 2600& 4000 \\ \hline
+ \end{tabular}
+ \\
+ \\
+ In Example 2, intensity of implication is 1 (for  $q(a,\overline{b}) = - 8.43$).
+ The entropic values of the experiment are $h_1 = 0.918$ and $h_2 = 0.391$.
+ The value of the moderator coefficient is therefore $i(a,b) = 0.6035$.
+ As a result $\psi(a,b) = 0.777$ whereas $P(B \mid A) = 0.6666$.
+ \\
+ \\
+{\bf remark}
+ \noindent The correspondence between $\varphi(a,b)$ and $\psi(a,b)$ is not monotonous as shown in the following example:
+ 
+\begin{tabular}{|c|c|c|c|}\hline
+  & $b$ & $\overline{b}$ & margin\\ \hline
+  $a$ & 40 & 20& 60 \\ \hline
+  $\overline{a}$ & 60 & 280& 340 \\ \hline
+  margin & 100 & 300& 400 \\ \hline
+\end{tabular}
+\\
+Thus, while $\varphi(a,b)$ decreased from the 1st to the 2nd example, $i(a,b)$ increased as well as $\psi(a,b)$.  On the other hand, the opposite situation is the most frequent.
+Note that in both cases, the conditional probability does not change.
+\\
+\\
+{\bf remark}
+\noindent We refer to~\cite{Lencaa} for a very detailed comparative study of association indices for binary variables.
+In particular, the intensities of classical and entropic (inclusion) implication presented in this article are compared with other indices according to a "user" entry.
+
+\section{Implication graph}
+\subsection{Problematic}
+
+At the end of the calculations of the intensities of implication in both the classical and entropic models, we have a table $p \times p$ that crosses the $p$ variables with each other, whatever their nature, and whose elements are the values of these intensities of implication, numbers of the interval $[0,~1]$.
+It must be noted that the underlying structure of all these variables is far from explicit and remains largely unimportant.
+The user remains blind to such a square table of size $p^2$.
+It cannot simultaneously embrace the possible multiple sequences of rules that underlie the overall structure of all $p$ variables.
+In order to facilitate a clearer extraction of the rules and to examine their structure, we have associated to this table, and for a given intensity threshold, an oriented graph, weighted by the intensities of implication, without a cycle whose complexity of representation the user can control by setting himself the threshold for taking into account the implicit quality of the rules.
+Each arc in this graph represents a rule: if $n_a < n_b$, the arc $a \rightarrow b$ represents the rule $a \Rightarrow b$ ; if $n_a = n_b$, then the arc $a \leftrightarrow b$ will represent the double rule $a \Leftrightarrow b$, in other words, the equivalence between these two variables.
+By varying the threshold of intensity of implication, it is obvious that the number of arcs varies in the opposite direction: for a threshold set at $0.95$, the number of arcs is less than or equal to those that would constitute the graph at threshold $0.90$. We will discuss this further below.
+
+\subsection{Algorithm}
+
+The relationship defined by statistical implication, if it is reflexive and not symmetrical, is obviously not transitive, as is induction and, on the contrary, deduction.
+However, we want it to model the partial relationship between two variables (the successes in our initial example).
+By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, we will accept the transitive closure $a \Rightarrow c$ only if $\psi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ to $c$ is better than neutrality by emphasizing the dependence between $a$ and $c$.
+
+
+{\bf VERIFIER PHI PSI}\\
+\\
+{\bf Proposal:} By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, there is a transitive closure $a \Rightarrow c$ if and only if $\psi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ over $c$, which reflects a certain dependence between $a$ and $c$, is better than its refutation.
+Note that for any pair of variables $(x;~ y)$, the arc $x \rightarrow y$ is weighted by the intensity of involvement (x,y).
+\\
+Let us take a formal example by assuming that between the 5 variables $a$, $b$, $c$, $d$, and $e$ exist, at the threshold above $0.5$, the following rules: $c \Rightarrow a$, $c \Rightarrow e$, $c \Rightarrow b$, $d \Rightarrow a$, $d \Rightarrow e$, $a \Rightarrow b$ and $a \Rightarrow e$.
+
+This set of numerical and graphical relationships can then be translated into the following table and graph:
+
+\begin{tabular}{|C{0.5cm}|c|c|c|c|c|}\hline
+\hspace{-0.5cm}\turn{45}{$\Rightarrow$}  & $a$ & $b$ & $c$ & $d$ & $e$\\ \hline
+$a$ &  & 0.97& & & 0.73 \\ \hline
+$b$ &  &     & & & \\ \hline  
+  $c$ & 0.82 & 0.975& & & 0.82 \\ \hline
+  $d$ & 0.78 &    &   &  & 0.92 \\ \hline
+  $e$ &  &     & & & \\ \hline  
+\end{tabular}
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=1]{chap2fig5.png}
+\caption{Implication graph corresponding to the previous example.}
+
+\label{chap2fig5}      
+\end{figure}
+
+One of the difficulties related to the graphical representation is that the graph is not planar.
+The algorithm that allows its construction must take it into account and, in particular, must "straighten" the paths of the graph in order to allow an acceptable readability for the expert who will analyze it.
+
+The number of arcs in the graph can be reduced (or increased) if we raise (or lower) the acceptance threshold of the rules, the level of confidence in the selected rules.
+Correlatively, arcs can appear or disappear depending on the variations of the threshold.
+Let us recall that this graph is necessarily without cycle, that it is not a lattice since, for example, the variable $a$ does not imply the variable ($a$ or $\neg a$) whose support is $E$.
+A fortiori, it cannot be a Galois lattice.
+Options of the CHIC software for automatic data processing with SIA, allow to delete variables at will, to move their image in the graph in order to decrease the arcs or to focus on certain variables called vertices of a kind of "cone" whose two "plots" are made up respectively of the variables "parents" and the variables "children" of this vertex variable.
+We refer to the ends of the arcs as "nodes". A node in a given graph has a single variable or a conjunction of variables.
+The transition from a node $S_1$ to a node $S_2$ is also called "transition" which is represented by an arc in the graph.
+The upper slick of the vertex cone the variable $a$, called the nodal variable, is made up of the "fathers" of $a$, either in the "causal" sense the causes of $a$ ; the lower slick, on the other hand, is made up of the "children" of $a$ and therefore, always in the causal sense, the consequences or effects of $a$.
+The expert in the field analysed here must be particularly interested in these configurations, which are rich in information.
+See, for example~\cite{Lahanierc} and the two implicit cones below (i.e. Figures~\ref{chap2fig6} and \ref{chap2fig7}).
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=0.75]{chap2fig6.png}
+\caption{Implicative cone.}
+
+\label{chap2fig6}      
+\end{figure}
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=0.75]{chap2fig7.png}
+\caption{Implicative cone centered on a variable.}
+
+\label{chap2fig7}      
+\end{figure}
+
+
+\section{Reduction in the number of variables}
+\subsection{Motivation}
+
+
+As soon as the number of variables becomes excessive, most of the available techniques become impractical.
+In particular, when an implicitive analysis is carried out by calculating association rules~\cite{Agrawal}, the number of rules discovered undergoes a combinatorial explosion with the number of variables, and quickly becomes inextricable for a decision-maker, provided that variable conjunctions are requested.
+In this context, it is necessary to make a preliminary reduction in the number of variables.
+
+Thus, ~\cite{Ritschard} proposed an efficient heuristic to reduce both the number of rows and columns in a table, using an association measure as a quasi-optimal criterion for controlling the heuristic.
+However, to our knowledge, in the various other research studies, the type of situation at the origin of the need to group rows or columns is not taken into account in the reduction criteria, whether the analyst's problem and aim are the search for similarity, dissimilarity, implication, etc., between variables. 
+