From: Raphaël Couturier Date: Fri, 3 May 2019 14:06:10 +0000 (+0200) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_chic.git/commitdiff_plain/d09b02c3768efb932835b5a4789d22e4bb212a5d?hp=c9f45698e9a535650b20a516d197dca86ed70d90 new --- diff --git a/chapter1.tex b/chapter1.tex index e59db46..03650d4 100644 --- a/chapter1.tex +++ b/chapter1.tex @@ -713,7 +713,7 @@ socio-psychology by successive interventions, by interviews or by internships (in collaboration with D. Pasquier~\cite{Pasquierc})? We then formalized these time-indexed variables into {\bf vector variables}, where a variable is modeled by a time-set vector. -Then, Julien Blanchard~\cite{Blanchard}, in collaboration with Fabrice Guillet and +Then, Julien Blanchard~\cite{Blanchardd}, in collaboration with Fabrice Guillet and Régis, defined {\bf sequential variables} modelled by a Poisson process in a different way. So many new concepts and new fields of application born of various diff --git a/chapter2.tex b/chapter2.tex index ed3ec5b..41a299b 100644 --- a/chapter2.tex +++ b/chapter2.tex @@ -907,7 +907,8 @@ is constant, independent of the rate of decrease of this number, of the variations of $n$ and $n_b$. This property seems not to satisfy intuition. The gradient of $c$ is expressed only in relation to $n_{a \wedge - \overline{b}}$ and $n_a$:(). {\bf CHECK FORMULA} + \overline{b}}$ and $n_a$: $\displaystyle \binom{ -\frac{1}{n_a}}{\frac{n_{a \wedge b}}{n_a^2}}$ + This may also appear to be a restriction on the role of parameters in expressing the sensitivity of the index. @@ -939,3 +940,100 @@ $$\frac{\partial}{\partial n_{a\wedge \overline{b}}}\left( \frac{\partial q}{\partial n_{a\wedge \overline{b}}} \right) $$ and the same for the other variables taken in pairs. However, we have, through the formulas (\ref{eq2.3}) and (\ref{eq2.4}) + +$$ \frac{\partial}{\partial n_{a \wedge b}} \left( \frac{\partial q}{\partial n_b} \right) = \frac{1}{2} \left( \frac{n_a}{n}\right)^{-\frac{1}{2}} \left( \frac{n_{\overline{b}}}{n}\right)^{-\frac{3}{2}} = \frac{\partial}{\partial n_b}\left( +\frac{\partial q}{\partial n_{a\wedge \overline{b}}} \right)$$ + +Thus, to the vector field C = ($n$, $n_a$, $n_b$, $n_{\overline{b}}$) of $E$, the nature of which we will specify, corresponds a gradient field $G$ which is said to be derived from the {\bf potential} $q$. +The gradient grad $q$ is therefore the vector that represents the spatial variation of the field intensity. +It is directed from low field values to higher values. By following the gradient at each point, we follow the increase in the intensity of the field's implication in space and, in a way, the speed with which it changes as a result of the variation of one or more parameters. + +For example, if we set 3 of the parameters $n$, $n_a$, $n_b$, $n_{\overline{b}}$ given by the realization of the couple ($a$, $b$), the gradient is a vector whose direction indicates the growth or decrease of $q$, therefore the decrease or increase of $|q|$ and, as a consequence of $\varphi$ the variations of the 4th parameter. +We have indicated this above by interpreting formula (\ref{eq2.5}). + + +\subsection{Level or equipotential lines} +An equipotential (or level) line or surface in the $C$ field is a curve of $E$ along which or on which a variable point $M$ maintains the same value of the potential $q$ (e.g. isothermal lines on the globe or level lines on an IGN map). + +The equation of this surface\footnote{In differential geometry, it seems that this surface is a (quasi) differentiable variety on board, compact, homeomorphic with closed pavement of the intervals of variation of the 4 parameters. Note that the point whose component $n_b$ is equal to $n$ (therefore = 0) is a singular point ( "catastrophic" in René Thom's sense) of the surface and $q$, the potential, is not differentiable at this point. Everywhere else, the surface is distinguishable, the points are all regular. If time, for example, parameters the observations of the process of which ($n$, $n_a$, $n_b$, $n_{\overline{b}}$) is a realization, at each instant corresponds a morphological fiber of the process represented by such a surface in space-time.} is, of course: +$$ q(a,\overline{b}) - \frac{n_{a \wedge \overline{b}}- + \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}}} = 0$$ + + +Therefore, on such a curve, the scalar product $grad~ q. dM$ is zero. +This is interpreted as indicating the orthogonality of the gradient with the tangent or hyperplane tangent to the curve, i.e. with the equipotential line or surface. +In a kinematic interpretation of our problem, the velocity of $M$'s path on the equipotential surface is orthogonal to the gradient in $M$. + +As an illustration in Figure~\ref{chap2fig2}, for a potential $F$ depending on only 2 variables, the figure below shows the orthogonal direction of the gradient with respect to the different equipotential surfaces along which the potential $F$ does not vary but passes from $F=7$ to $F= 10$. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1]{chap2fig2} + \caption{Illustration of potential of 2 variables} +\label{chap2fig2} % Give a unique label +\end{figure} + +It is possible in the case of the potential $q$, to build equipotential surfaces as above (two-dimensional for ease of representation). +It is understandable that the more intense the field is, the tighter the surfaces are. For a given value of $q$, in this case, 3 variables are set, for example $n$, $n_a$, $n_b$ and a value of $q$ compatible with the field constraints. Either: $n = 104$; $n_a = 1600 \leq nb = 3600$ and $q = -2$ or $|q| = 2$. We then find $n_{\overline{b}}= 528$ using formula~(\ref{eq2.1}). +But the points ($10^4$, $1600$, $5100$, $5100$, $728$) and ($100$, $25$, $64$, $3$) also belong to this surface and the same equipotential curve. +The point ($104$, $1600$, $3600$, $3600$, $928$) belongs to the equipotential curve $q=-3$). In fact, on this entire surface, we obtain a kind of homeostasis of the intensity of implication. + +The expression of the function $q$ of the variable shows that it is convex. +This property proves that the segment of points $t.M_1 + (1-t).M_2$, for $t \in [0,1]$ which connects two points $M_1$ and $M_2$ of the same equipotential line is entirely contained in its convexity. +The figure below shows two adjacent equipotential surfaces $\sum_1$ and $\sum_2$ in the implicit field corresponding to two values of the potential $q_1$ and $q_2$. +At point $M_1$ the scalar field therefore takes the value $q_1$. $M_2$ is the intersection of the normal from $M_1$ with $\sum_2$. Given the direction of the normal vector $\vec{n}$ the difference $\delta = q2 - q1$, variation of the field when we go from $\sum_1$ to $\sum_2$ is then equal to the opposite of the norm of the gradient from $q$ to $M_1$ is $\frac{\partial q}{\partial n}$, if $n_a$, $n_b$ and $n_{a \wedge \overline{b}}$ are fixed. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1]{chap2fig3} + \caption{Illustration of equipotential surfaces} +\label{chap2fig3} % Give a unique label +\end{figure} + +Thus, the space $E$ can be laminated by equipotential surfaces corresponding to successive values of $q$ relative to the cardinals ($n$, $n_a$, $n_b$, $n_{a \wedge \overline{b}}$) which would be varied. +This situation corresponds to the one envisaged in the SIA modeling. +Fixing $n$, $n_a$ and $n_b$, we consider the random sets $X$ and $Y$ of the same cardinals as $A(n_a)$ and $B(n_b)$ and whose cardinal follows a Poisson's law or a binomial law, according to the choice of the model. +The different gradient fields, real "lines of force", associated with them are orthogonal to the surfaces defined by the corresponding values of $Q$. +This reminds us, in the theoretical framework of potential, of the premonitory metaphor of "implicit flow" that we expressed in~\cite{Grase} and that we will discuss again in Chapter 14 of the book. +Behind this notion we can imagine a transport of information of variable intensity in a causal universe. +We illustrate this metaphor with the study of the properties of the two-layer implicit cone (see §2.8). +Moreover and intuitively, the implication $a\Rightarrow b$ is of as good quality as the equipotential surface $C$ of the contingency covers random equipotential surfaces depending on the random variable. +Let us recall the relationship that unites the potential q with the intensity: +$$\varphi(a,b) =\frac{1}{\sqrt{2\pi}}\int_{q(a,\overline{b})}^{\infty}e^{-\frac{t^2}{2}} dt$$ + +\noindent {\bf remark 1}\\ +It can be seen that the intensity is also invariant on any equipotential surface of its own variations. +The surface portions generated by $q$ and by $\varphi$ are even in one-to-one correspondence. +In intuitive terms, we can say that when one "swells" the other "deflates".\\ + +\noindent {\bf remark 2}\\ +Let us note once again a particularity of the intensity of implication. +While the surfaces generated by the variations of the 4 parameters of the data are not invariant by the same dilation of the parameters, those associated with the indices cited in §2.4 are invariant and have the same undifferentiated geometric shape. + +\section{Implication-inclusion} +\subsection{Foundational and problematic situation} +Three reasons led us to improve the model formalized by the intensity of involvement: +\begin{itemize} +\item when the size of the samples processed, and in particular that of $E$, increases (by around a thousand and more), the intensity $\varphi(a,b)$ no longer tends to be sufficiently discriminating because its values can be very close to 1, while the inclusion whose quality it seeks to model is far from being satisfied (phenomenon reported in~\cite{Bodina} which deals with large student populations through international surveys); +\item the previous quasi-implication model essentially uses the measure of the strength of rule $a \Rightarrow b$. + However, taking into account a concomitance of $\neg b \Rightarrow \neg a$ (contraposed of implication) is useful or even essential to reinforce the affirmation of a good quality of the quasi-implicative, possibly quasi-causal, relationship of $a$ over $b$\footnote{This phenomenon is reported by Y. Kodratoff in~\cite{Kodratoff}.}. + At the same time, it could make it possible to correct the difficulty mentioned above (if $A$ and $B$ are small compared to $E$, their complementary will be important and vice versa); +\item the overcoming of Hempel's paradox (see Appendix 3 of this chapter). + \end{itemize} + +\subsection{An inclusion index} + +The solution\footnote{J. Blanchard provides in~\cite{Blanchardb} an answer to this problem by measuring the "equilibrium gap".} we provide uses both the intensity of implication and another index that reflects the asymmetry between situations $S_1 = (a \wedge b)$ and $S_1' = (a \wedge \neg b)$, (resp. $S2 = (\neg a \wedge \neg b)$ and $S_2' = (a \wedge \neg b)$) in favour of the first named. +The relative weakness of instances that contradict the rule and its counterpart is therefore fundamental. +Moreover, the number of counter-examples $n_{a \wedge \overline{b}}$ to $a\ Rightarrow b$ is the one to the contraposed one. +To account for the uncertainty associated with a possible bet of belonging to one of the two situations ($S_1$ or $S_1'$, (resp. $S_2$ or $S_2'$)), we therefore refer to Shannon's concept of entropy~\cite{Shannon}: +$$H(b\mid a) = - \frac{n_{a\wedge b}}{n_a}log_2 \frac{n_{a\wedge b}}{n_a} - \frac{n_{a\wedge \overline{b}}}{n_a}log_2 \frac{n_{a\wedge \overline{b}}}{n_a}$$ +is the conditional entropy relating to boxes $(a \wedge b)$ and $(a \wedge \neg b)$ when $a$ is realized + +$$H(\overline{a}\mid \overline{b}) = - \frac{n_{a\wedge \overline{b}}}{n_{\overline{b}}}log_2 \frac{n_{a\wedge \overline{b}}}{n_{\overline{b}}} - \frac{n_{\overline{a} \wedge \overline{b}}}{n_{\overline{b}}}log_2 \frac{n_{\overline{a} \wedge \overline{b}}}{n_{\overline{b}}}$$ + +is the conditional entropy relative to the boxes $(\neg a \wedge \neg b)$ and $(a \wedge \neg b)$ when not $b$ is realized. + +These entropies, with values in $[0,1]$, should therefore be simultaneously weak and therefore the asymmetries between situations $S_1$ and $S_1'$ (resp. $S_2$ and $S_2'$) should be simultaneously strong if one wishes to have a good criterion for including $A$ in $B$. +Indeed, entropies represent the average uncertainty of experiments that consist in observing whether b is performed (or not a is performed) when a (or not b) is observed. The complement to 1 of this uncertainty therefore represents the average information collected by performing these experiments. The more important this information is, the stronger is the guarantee of the quality of the involvement and its counterpart. We must now adapt this entropic numerical criterion to the model expected in the different cardinal situations. +For the model to have the expected meaning, it must satisfy, in our opinion, the following epistemological constraints: diff --git a/figures/chap2fig2.png b/figures/chap2fig2.png new file mode 100644 index 0000000..9990608 Binary files /dev/null and b/figures/chap2fig2.png differ diff --git a/figures/chap2fig3.png b/figures/chap2fig3.png new file mode 100644 index 0000000..f0f6770 Binary files /dev/null and b/figures/chap2fig3.png differ diff --git a/references.tex b/references.tex index 62d3b12..650bc39 100644 --- a/references.tex +++ b/references.tex @@ -54,8 +54,23 @@ \bibitem{Benzecri} Benzecri, J.P. (1973) L’analyse des données (vol 1), Dunod, Paris. \bibitem{Bernard} Bernard J.-M. and Poitrenaud S. (1999) L'analyse implicative bayesienne d'un questionnaire binaire : quasi-implications et treillis de Galois simplifié", Mathématiques, Informatique et Sciences Humaines, n° 147, 1999, 25-46 + + +\bibitem{Blancharda} Blanchard J., Kuntz P., Guillet F. and Gras R. (2004) Mesure de la qualité des régles d’association par l’intensité d’implication entropique, Mesures de qualité pour la fouille de données, RNTI-E-1, p. 33-44. + +\bibitem{Blanchardb} Blanchard J., Guillet F., Briand H. and Gras R. (2005) Ipee: Indice probabiliste d'écart à l'équilibre pour l'évaluation de la qualité des règles, actes Atelier Qualité des Données et des Connaissances, pp. 26-34. + +\bibitem{Blanchardc} Blanchard J., Guillet F. and Gras R. (2008) Assessing the interestingness of temporal rules, Statistical Implicative Analysis, R.Gras, E. Suzuki, F.Guillet and F.Spagnolo, Eds, Springer-Verlag, Berlin-Heidelberg, ISBN 978-3-540-78982-6. + + + \bibitem{Blanchard} Blanchard J., Guillet F. et Gras R. (2009) Analyse Implicative Séquentielle, Analyse Statistique Implicative, Une méthode d’analyse de données pour la recherche de causalités, dir. R.Gras, eds R.Gras, J.-C. Régnier, Guillet F. Cépaduès, Toulouse, p. 183-194. + +\bibitem{Bodina} Bodin A. (1997) Modèles sous-jacents à l'analyse implicative et outils complémentaires. Prépublication IRMAR. n°97-32. + +\bibitem{Bodinb} Bodin A. and Gras R. (1999) Analyse du préquestionnaire enseignants avant EVAPM-Terminales, Bulletin n°425 de l'Association des Professeurs de Mathématiques de l'Enseignement Public, 772-786, Paris. + \bibitem{Couturiera} Couturier R. (2001) Traitement de l’analyse statistique implicative dans CHIC, Actes des Journées sur la @@ -295,6 +310,9 @@ Cépaduès Ed. Toulouse, p. 195-208, ISBN: 978.2.36493.577.8. \bibitem{Simon} Simon A. (2001),Outils classificatoires par objets pour l'extraction de connaissances dans des bases de données, Thèse de l'Université de Nancy 1. + +\bibitem{Shannon} Shannon C. E. and Weaver W. (1949) The mathematical theory of communication, Univ. of Illinois Press. + \bibitem{Shapin} Shapin S. (2014) Une histoire sociale de la vérité, La Découverte, Paris.