new

[book_chic.git] / chapter2.tex
diff --git a/chapter2.tex b/chapter2.tex

index b50c9dd8205e9aa12b9ba88ad3a2c4eb300c1483..8c52f22a104071efab3de3c67832aff8c4536477 100644 (file)
--- a/chapter2.tex
+++ b/chapter2.tex
@@ -313,8 +313,8 @@ $a\Rightarrow b$, for $n_a\leq n_b$ and  $nb \neq n$, is then defined
  from the index $q(a,\overline{b})$ by:
  
  \definition 
  from the index $q(a,\overline{b})$ by:
  
  \definition 
-The implication intensity  that measures the inductive quality of a
-over b is:
+The implication intensity  that measures the inductive quality of $a$
+over $b$ is:
  $$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] =
  \frac{1}{\sqrt{2 \pi}} \int^{\infty}_{ q(a,\overline{b})}
  e^{-\frac{t^2}{2}} dt,~ if~ n_b \neq n$$
  $$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] =
  \frac{1}{\sqrt{2 \pi}} \int^{\infty}_{ q(a,\overline{b})}
  e^{-\frac{t^2}{2}} dt,~ if~ n_b \neq n$$
@@ -1181,3 +1181,130 @@ $b$ &  &     & & & \\ \hline
  
  \label{chap2fig5}      
  \end{figure}
  
  \label{chap2fig5}      
  \end{figure}
+
+One of the difficulties related to the graphical representation is that the graph is not planar.
+The algorithm that allows its construction must take it into account and, in particular, must "straighten" the paths of the graph in order to allow an acceptable readability for the expert who will analyze it.
+
+The number of arcs in the graph can be reduced (or increased) if we raise (or lower) the acceptance threshold of the rules, the level of confidence in the selected rules.
+Correlatively, arcs can appear or disappear depending on the variations of the threshold.
+Let us recall that this graph is necessarily without cycle, that it is not a lattice since, for example, the variable $a$ does not imply the variable ($a$ or $\neg a$) whose support is $E$.
+A fortiori, it cannot be a Galois lattice.
+Options of the CHIC software for automatic data processing with SIA, allow to delete variables at will, to move their image in the graph in order to decrease the arcs or to focus on certain variables called vertices of a kind of "cone" whose two "plots" are made up respectively of the variables "parents" and the variables "children" of this vertex variable.
+We refer to the ends of the arcs as "nodes". A node in a given graph has a single variable or a conjunction of variables.
+The transition from a node $S_1$ to a node $S_2$ is also called "transition" which is represented by an arc in the graph.
+The upper slick of the vertex cone the variable $a$, called the nodal variable, is made up of the "fathers" of $a$, either in the "causal" sense the causes of $a$ ; the lower slick, on the other hand, is made up of the "children" of $a$ and therefore, always in the causal sense, the consequences or effects of $a$.
+The expert in the field analysed here must be particularly interested in these configurations, which are rich in information.
+See, for example~\cite{Lahanierc} and the two implicit cones below (i.e. Figures~\ref{chap2fig6} and \ref{chap2fig7}).
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=0.75]{chap2fig6.png}
+\caption{Implicative cone.}
+
+\label{chap2fig6}      
+\end{figure}
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=0.75]{chap2fig7.png}
+\caption{Implicative cone centered on a variable.}
+
+\label{chap2fig7}      
+\end{figure}
+
+
+\section{Reduction in the number of variables}
+\subsection{Motivation}
+
+
+As soon as the number of variables becomes excessive, most of the available techniques become impractical\footnote{This paragraph is strongly inspired by paper~\cite{Grask}.}.
+In particular, when an implicitive analysis is carried out by calculating association rules~\cite{Agrawal}, the number of rules discovered undergoes a combinatorial explosion with the number of variables, and quickly becomes inextricable for a decision-maker, provided that variable conjunctions are requested.
+In this context, it is necessary to make a preliminary reduction in the number of variables.
+
+Thus, ~\cite{Ritschard} proposed an efficient heuristic to reduce both the number of rows and columns in a table, using an association measure as a quasi-optimal criterion for controlling the heuristic.
+However, to our knowledge, in the various other research studies, the type of situation at the origin of the need to group rows or columns is not taken into account in the reduction criteria, whether the analyst's problem and aim are the search for similarity, dissimilarity, implication, etc., between variables. 
+
+Also, to the extent that there are very similar variables in the sense of statistical implication, it might be appropriate to substitute a single variable for these variables that would be their leader in terms of representing an equivalence class of similar variables for the implicit purpose.
+We therefore propose, following the example of what is done to define the notion of quasi-implication, to define a notion of quasi-equivalence between variables, in order to build classes from which we will extract a leader.
+We will illustrate this with an example.
+Then, we will consider the possibility of using a genetic algorithm to optimize the choice of the representative for each quasi-equivalence class.
+
+\subsection{Definition of quasi-equivalence}
+
+Two binary variables $a$ and $b$ are logically equivalent for the SIA when the two quasi-implications $a \Rightarrow b$ and $b \Rightarrow a$ are simultaneously satisfied at a given threshold.
+We have developed criteria to assess the quality of a quasi-involvement: one is the statistical surprise based on the likelihood of~\cite{Lerman} relationship, the other is the entropic form of quasi-inclusion~\cite{Grash2} which is presented in this chapter (§7). 
+
+According to the first criterion, we could say that two variables $a$ and $b$ are almost equivalent when the intensity of involvement $\varphi(a,b)$ of $a\Rightarrow b$ is little different from that of $b \Rightarrow a$. However, for large groups (several thousands), this criterion is no longer sufficiently discriminating to validate inclusion.
+
+According to the second criterion, an entropic measure of the imbalance between the numbers $n_{a \wedge b}$ (individuals who satisfy $a$ and $b$) and $n_{a \wedge \overline{b}} $ (individuals who satisfy $a$ and $\neg b$, counter-examples to involvement $a\Rightarrow b$) is used to indicate the quality of involvement $a\Rightarrow b$, on the one hand, and the numbers $n_{a \wedge b}$ and  $n_{\overline{a} \wedge b}$ to assess the quality of mutual implication  $b\Rightarrow a$, on the other. 
+
+
+Here we will use a method comparable to that used in Chapter 3 to define the entropic implication index.
+
+By posing $n_a$ and $n_b$, respectively effective of $a$ and $b$, the imbalance of the rule $a\Rightarrow b$ is measured by a conditional entropy $K(b \mid a=1)$, and that of $b\Rightarrow a$ by $K(a \mid b=1)$ with: 
+
+
+\begin{eqnarray*}
+  K(b\mid a=1)  = - \left( 1- \frac{n_{a\wedge b}}{n_a}\right) log_2  \left( 1- \frac{n_{a\wedge b}}{n_a}\right)   - \frac{n_{a\wedge b}}{n_a}log_2   \frac{n_{a\wedge b}}{n_a} & \quad if \quad \frac{n_{a \wedge b}}{n_a} > 0.5\\
+  K(b\mid a=1)  = 1 &  \quad if \quad \frac{n_{a \wedge b}}{n_a} \leq 0.5\\
+  K(a\mid b=1)  = - \left( 1- \frac{n_{a\wedge b}}{n_b}\right) log_2  \left( 1- \frac{n_{a\wedge b}}{n_b}\right)   - \frac{n_{a\wedge b}}{n_b}log_2   \frac{n_{a\wedge b}}{n_b} & \quad if \quad \frac{n_{a \wedge b}}{n_b} > 0.5\\
+  K(a\mid b=1)  = 1 &  \quad if \quad \frac{n_{a \wedge b}}{n_b} \leq 0.5
+\end{eqnarray*}
+
+These two entropies must be low enough so that it is possible to bet on $b$ (resp. $a$) with a good certainty when $a$ (resp. $b$) is achieved. Therefore their respective complements to 1 must be simultaneously strong.
+
+\begin{figure}[htbp]
+  \centering
+\includegraphics[scale=0.5]{chap2fig8.png}
+\caption{Illustration of the functions $K$ et $1-K^2$ on $[0; 1]$ .}
+
+\label{chap2fig7}      
+\end{figure}
+
+
+\definition A first entropic index of equivalence is given by:
+$$e(a,b) = \left (\left[ 1 - K^2(b \mid a = 1)\right ]\left[  1 - K^2(a \mid b = 1)  \right]\right)^{\frac{1}{4}}$$
+
+When this index takes values in the neighbourhood of $1$, it reflects a good quality of a double implication.
+In addition, in order to better take into account $a \wedge b$ (the examples), we integrate this parameter through a similarity index $s(a,b)$ of the variables, for example in the sense of I.C. Lerman~\cite{Lermana}.
+The quasi-equivalence index is then constructed by combining these two concepts.
+
+\definition A second entropic equivalence index is given by the formula
+
+$$\sigma(a,b)= \left [ e(a,b).s(a,b)\right ]^{\frac{1}{2}}$$
+
+From this point of view, we then set out the quasi-equivalence criterion that we use.
+
+\definition  The pair of variables $\{a,b\}$ is said to be almost equivalent for the selected quality $\beta$ if $\sigma(a,b) \geq \beta$.
+For example, a value $\beta=0.95$ could be considered as a good quasi-equivalence between $a$ and $b$.
+
+\subsection{Algorithm of construction of quasi-equivalence classes}
+
+Let us assume a set $V = \{a,b,c,...\}$ of $v$ variables with a valued relationship $R$ induced by the measurement of quasi-equivalence on all pairs of $V$.
+We will assume the pairs of variables classified in a decreasing order of quasi-equivalence.
+If we have set the quality threshold for quasi-equivalence at $\beta$, only the first of the pairs $\{a,b\}$ checking for inequality $\sigma(a,b)\ge \beta$ will be retained.
+In general, only a part $V'$, of cardinal $v'$, of the variables of $V$ will verify this inequality.
+If this set $V'$ is empty or too small, the user can reduce his requirement to a lower threshold value.
+The relationship being symmetrical, we will have at most pairs to study.
+As for $V-V'$, it contains only non-reducible variables.
+
+We propose to use the following greedy algorithm:
+\begin{enumerate}
+\item A first potential class $C_1^0= \{e,f\}$ is constituted such that $\sigma(e,f)$ represents the largest of the $\beta$-equivalence values.
+  If possible, this class is extended to a new class $C_1$ by taking from $V'$ all the elements $x$ such that any pair of variables within this class allows a quasi-equivalence greater than or equal to $\beta$;
+
+\item We continue with:
+  \begin{enumerate}
+  \item If $o$ and $k$ forming the pair $(o,k)$ immediately below $(e,f)$ according to the index $\sigma$, belong to $C_1$, then we move to the pair immediately below (o,k) and proceed as in 1.;
+  \item If $o$ and $k$ do not belong to $C_1$, proceed as in 1. from the pair they constitute by forming the basis of a new class;
+    \item If $o$ or $k$ does not belong to $C_1$, one of these two variables can either form a singleton class or belong to a future class. On this one, we will of course practice as above.
+    \end{enumerate}
+  \end{enumerate}
+
+After a finite number of iterations, a partition of $V$ is available in $r$ classes of $\sigma$-equivalence: $\{C_1, C_2,..., C_r\}$.
+The quality of the reduction may be assessed by a gross or proportional index of $\beta^{\frac{p}{k}}$.
+However, we prefer the criterion defined below, which has the advantage of integrating the choice of representative.
+
+In addition, $k$ variables representing the $k$ classes of $\sigma$-equivalence could be selected on the basis of the following elementary criterion: the quality of connection of this variable with those of its class.
+However, this criterion does not optimize the reduction since the choice of representative is relatively arbitrary and may be a sign of triviality of the variable.
+