From: Raphaël Couturier Date: Wed, 13 Nov 2019 17:01:49 +0000 (+0100) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_chic.git/commitdiff_plain/fb27620322244ea82ffcc72583b1c77b53778e72?hp=ca2470ce26187c7be2c722bc2ff54011525c94f4 new --- diff --git a/book.tex b/book.tex index b384e21..b903859 100644 --- a/book.tex +++ b/book.tex @@ -34,6 +34,7 @@ \usepackage{diagbox} \usepackage{adjustbox} + \newcommand{\turn}[3][10em]{% \turn[]{}{} \rlap{\rotatebox{#2}{\begin{varwidth}[t]{#1}\bfseries#3\end{varwidth}}}% } @@ -77,7 +78,6 @@ \include{references} %\include{chapter} -%\include{appendix} \backmatter%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \include{glossary} diff --git a/chapter2.tex b/chapter2.tex index 8c52f22..3b35fe8 100644 --- a/chapter2.tex +++ b/chapter2.tex @@ -1153,12 +1153,10 @@ By varying the threshold of intensity of implication, it is obvious that the num The relationship defined by statistical implication, if it is reflexive and not symmetrical, is obviously not transitive, as is induction and, on the contrary, deduction. However, we want it to model the partial relationship between two variables (the successes in our initial example). -By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, we will accept the transitive closure $a \Rightarrow c$ only if $\psi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ to $c$ is better than neutrality by emphasizing the dependence between $a$ and $c$. - - -{\bf VERIFIER PHI PSI}\\ +By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, we will accept the transitive closure $a \Rightarrow c$ only if $\varphi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ to $c$ is better than neutrality by emphasizing the dependence between $a$ and $c$. \\ -{\bf Proposal:} By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, there is a transitive closure $a \Rightarrow c$ if and only if $\psi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ over $c$, which reflects a certain dependence between $a$ and $c$, is better than its refutation. + +{\bf Proposal:} By convention, if $a \Rightarrow b$ and $b \Rightarrow c$, there is a transitive closure $a \Rightarrow c$ if and only if $\varphi(a,c) \geq 0.5$, i.e. if the implicit relationship of $a$ over $c$, which reflects a certain dependence between $a$ and $c$, is better than its refutation. Note that for any pair of variables $(x;~ y)$, the arc $x \rightarrow y$ is weighted by the intensity of involvement (x,y). \\ Let us take a formal example by assuming that between the 5 variables $a$, $b$, $c$, $d$, and $e$ exist, at the threshold above $0.5$, the following rules: $c \Rightarrow a$, $c \Rightarrow e$, $c \Rightarrow b$, $d \Rightarrow a$, $d \Rightarrow e$, $a \Rightarrow b$ and $a \Rightarrow e$. @@ -1258,7 +1256,7 @@ These two entropies must be low enough so that it is possible to bet on $b$ (res \includegraphics[scale=0.5]{chap2fig8.png} \caption{Illustration of the functions $K$ et $1-K^2$ on $[0; 1]$ .} -\label{chap2fig7} +\label{chap2fig8} \end{figure} @@ -1302,9 +1300,149 @@ We propose to use the following greedy algorithm: \end{enumerate} After a finite number of iterations, a partition of $V$ is available in $r$ classes of $\sigma$-equivalence: $\{C_1, C_2,..., C_r\}$. -The quality of the reduction may be assessed by a gross or proportional index of $\beta^{\frac{p}{k}}$. +The quality of the reduction may be assessed by a gross or proportional index of $\beta^{\frac{r}{k}}$. However, we prefer the criterion defined below, which has the advantage of integrating the choice of representative. In addition, $k$ variables representing the $k$ classes of $\sigma$-equivalence could be selected on the basis of the following elementary criterion: the quality of connection of this variable with those of its class. However, this criterion does not optimize the reduction since the choice of representative is relatively arbitrary and may be a sign of triviality of the variable. +\section{Conclusion} + +This overview of the development of implicit statistical analysis shows, if necessary, how a data processing theory is built step by step in response to problems presented by experts from various fields and in response to epistemological requirements that respect common sense and intuition. +It therefore appears differently than as a view of the mind since it is directly applicable to the situations that lead to its genesis. +The extensions made to the types of data processed, to the modes of representation of their structures, to the relationships between subjects, their descriptors and variables are indeed the result of the experts' greedy questions. +Its respective functions as developer and analyzer seem to operate successfully in multiple application areas. + +We will have noticed that the theoretical basis is simple, which could be the reason for its fertility. +Even if the questioning of primitive theoretical choices is not apparent here, this genesis has not been without conflicts between the expected answers, the ease of their access and therefore these answers have been sources of restoration or even redesign; often discussed within the research team. +In any case, this method of data analysis will have made it possible and will, Régis hopes, still make it possible to highlight living structures thanks to the non-symmetrical approach on which it is based. + +Among the current or future work proposed to our team, one concerns an extension of the SIA to vector variables in response to problems in proteomics. +Another is more broadly concerned with the relationship between SIA and the treatment of fuzzy sets (see Chapter 7). +The function of the "implication" fuzzy logic operator will be illustrated by new applications. +Through another subject, we will review our method to allow the SIA to solve the problem of data table vacancies, as well as the ongoing work on reducing redundant rules in SIA. +Finally, it is clear that this work will be conducted interactively with applications and, in particular, the contribution of SIA to the classification rule in the leaves of classification trees. + + + +\section{Annex1: Two models of the classical implication intensity} + +\subsection{Binomial model} + +To examine the quality of quasi-rule $a \Rightarrow b$, in the case where the variables are binary, is to measure equivalently that of the inclusion of the subset of transactions satisfying $a$ in the subset of transactions satisfying $b$. +The counter-examples relating to inclusion are indeed the same as those relating to the implication expressed by: "any satisfactory transaction $a$ has also satisfied $b$". +From this overall perspective, as soon as $n_a n_b$, the quality of the quasi-rule $a \Rightarrow b$, can only be semantically better than the one of $b \Rightarrow a$. +We will therefore assume, later on, that $n_a \leq n_b$ when studying $a \Rightarrow b$. In this case, the main population is finite and $Card~ E = n$. + +Binomial modelling was the first to be adopted chronologically (see~\cite{Grasb} chap. 2). +It was compared to other models in~\cite{Lermana}. +Let us briefly recall what the binomial model consists of. +With the adopted notations, $X$ and $Y$ are two random subsets, independently chosen from all the parts of $E$, respectively of the same cardinal $n_a$ and $n_b$ as the subsets of the realizations of $a$ and $b$. +The observed value $n_{a \wedge b}$ can be considered as the realization of a random variable $Card(X\cap Y)$ which represents the random number of counter-examples to the inclusion of $X$ in $Y$, counter-examples observed during $n$ successive independent draws. From there, $Card(X\cap \overline{Y})$ can be considered as a binomial variable of parameters $n$ and $\pi$ where $\pi$ is itself estimated by $p = \frac{n_a}{n}\frac{n_b}{n}$. Thus: + +$$Pr[Card(X\cap \overline{Y})= k]= C_n^k\left( \frac{n_an_{\overline{b}}}{n^2} \right)^k \left(1-\frac{n_a n_{\overline{b}}}{n^2} \right)^{n-k} $$ + +The estimated reduced centered variable $Q(a,~\overline{b})$ then accepts as a realization: + +$$q(a,\overline{b}) = \frac{n_{a \wedge \overline{b}}- + \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}(1-\frac{n_a n_{\overline{b}}}{n^2})} }$$ + +As before, we obtain the estimated intensity of empirical implication: +$$\varphi(a,b)=1-Pr[Q(a,\overline{b})\leq q(a,\overline{b})] = 1 - \sum _0^{n_{a \wedge \overline{b}}} C_n^k\left (\frac{n_an_{\overline{b}}}{n^2}\right )^k\left (1-\frac{n_an_{\overline{b}}}{n^2}\right )^{n-k}$$ + + +The probability law of $Q(a,\overline{b})$ can be approximated by the one of the Laplace-Gauss law centred reduced $N(0,1)$. Generally, the intensity calculated in the Poisson model is more "severe" than the intensity derived from the binomial model in the sense that $\varphi(a,b)_{Poisson} \leq \varphi(a,b)_{Binomial}$. + +\remark We can note that the implication index is null if and only if the two variables $a$ and $b$ are independent. Indeed, we have +$$ q(a,\overline{b}) = \frac{n_{a \wedge \overline{b}}- + \frac{n_a.n_{\overline{b}}}{n}}{\sqrt{\frac{n_a.n_{\overline{b}}}{n}(1-\frac{n_a n_{\overline{b}}}{n^2})} } =0 \iff n_{a \wedge \overline{b}}- \frac{n_a.n_{\overline{b}}}{n}=0$$ + +$$q(a,\overline{b}) =0 \iff n_{a \wedge \overline{b}}=\frac{n_a.n_{\overline{b}}}{n}~ \mbox{or }~ q(a,\overline{b}) =0 \iff \frac{n_a.n_{\overline{b}}}{n}=\frac{n_a}{n}\frac{n_{\overline{b}}}{n}$$ + +This last relationship reflects the property of statistical independence. + +\subsection{Hypergeometric model} +Let us briefly recall the 3rd modelling proposed in \cite{Lermana} and \cite{Grasd}. We repeat the same approach: $A$ and $B$ are the parts of $E$ representing the individuals satisfying $a$ and $b$ respectively and whose cardinals are $card (A)=n_a$ and $card (B)=n_b$. Then let us consider, two independent random parts $X$ and $Y$ such that $card (X)=n_a$ and $card (Y)=n_b$. The random variable $Card(A \cap \overline{Y})$ represents the random number of elements of $E$ which, being in $A$ are not in $Y$. This variable follows a hypergeometric law and we have for all $kn_a$: + +$$Pr[Card(A \cap \overline{Y})=k]=\frac{C_{n_a}^k C_{n-n_a}^{n-n_b-k}}{C_n^{n-n_b}} =\frac{n_a!n_{\overline{a}}! n_b!n_{\overline{b}}! }{k!n!(n_a-k)!(n_{\overline{b}}-k)! (n_b-n_a+k)! }$$ + +$$\frac{C_{n-n_b}^k C_{n_b}^{n_a-k}}{C_n^{n_a}} = Pr[Card(X \cap \overline{B})=k]$$ + +This shows, by exchanging the role of $a$ and $b$, that the empirical implication index $Q(a,\overline{b})$ corresponding to the quasi-rule $a \Rightarrow b$, is the same as the one corresponding to the reciprocal, i.e. $Q(b,\overline{a})$ . We thus obtain the same intensity for the quasi-rule $a \Rightarrow b$ and for the reciprocal quasi-rule $b \Rightarrow a$. + +\subsection{Choice of models to evaluate the intensity of implication} +If binomial modeling remains compatible with the semantics of implication, a non-symmetric binary relationship, the same cannot be said for hypergeometric modeling since it does not distinguish the quality of a quasi-rule from that of its reciprocal and has a low pragmatic character. +Consequently, we will only retain the Poisson model and the binomial model as models adapted to the semantics of involvement between binary variables. + + +The legitimate coexistence of three different models of our problem of measuring the quality of a quasi-rule is not inconsistent: it is due to the way in which the drawing of transactions (Poisson's law) or sets of grouped transactions (binomial law or hypergeometric law) is taken into account one by one. In addition, we know that when the total number of transactions becomes very large, all three models converge on the same Gaussian model. In~\cite{Lallich}, we find, as a generalization, a parameterization of the three indices obtained by these models, which allows us to evaluate the interest of the rules obtained by comparing them to a given threshold. + +\section{Annex 2: Modelling of implication integrating confidence and surprise} + +Recently, in~\cite{Grasab}, we have assembled two statistical concepts that we believe are internal to the implicit relationship between two variables $a$ and $b$: +\begin{itemize} +\item on the one hand, the intensity of involvement $\varphi(a,b)$ measuring surprise or astonishment at the low number of counter-examples to implication between these variables +\item on the other hand, the confidence $C(b \mid a)$ measuring the conditional frequency of $b$ knowing $a$ who is involved in the majority of the other implication indices as we have seen in §2.5.4. +\end{itemize} + +So, we claim, by plagiarizing G. Vergnaud~\cite{Vergnaudd} speaking about aesthetics, that there is no data analysis without {\bf confidence} (psychological level). But there is also no data analysis without {\bf surprise}\footnote{This is also what René Thom says in~\cite{Thoma} p. 130: (translated in english) "...the problem is not to describe reality, the problem is much more to identify in it what makes sense to us, what is surprising in all the facts. If the facts do not surprise us, they do not bring any new element to the understanding of the universe: we might as well ignore them" and further on: "... which is not possible if we do not already have a theory".} (statistical level), nor without {\bf scale correction} (pragmatic level). The two concepts (confidence and intensity of implication) therefore respond to relatively distinct but not contradictory principles: confidence is based on the subordination of variable $b$ to variable $a$ while intensity of implication is based on counter-examples to the subjection relationship of $b$ by $a$. + +It is demonstrated in~\cite{Grasab} that, for any $\alpha$ that the ratio + +$$ \frac{Pr[C(b\mid a)\geq \alpha]}{Pr[\varphi(a,b)\geq \alpha]}~\mbox{is close of}~ \frac{Pr[C(b \mid a) \geq \alpha}{1-\alpha}$$ + + +Under these conditions, this ratio is a good indicator of satisfaction between confidence and intensity of implication: greater than 1, confidence is then better than intensity; less than 1, intensity is stronger. Further research could be based on this indicator. + +Finally, as we did for entropic intensity, we will take into account the contraposed by associating the two conditional frequencies of b knowing a, i.e. $C_1(a,b)$ (for direct implication $a \Rightarrow b$) and $no~ a$ knowing $no~ b$, $C_2(a,b)$ (for contraposed implication $\neg b \Rightarrow \neg a$). Finally, we choose the following formula to define a new measure of implication that we call {\bf implifiance} in French (implication + confidence): + +$$ \phi(a,b)=\varphi(a,b).\left [ C_1(a,b).C_2(a,b) \right ]^{\frac{1}{4}}$$ + +For example, if we extract a rule whose implication is equal to $0.95$, its intensity of implication is at least equal to $0.95$ and each of the $C_1$ and $C_2$ confidences is at least equal to $0.81$. If the implication is equal to $0.90$, the respective minima are $0.90$ and $0.66$, which preserves the plausibility of the rule. + +The following two figures show the respective variations in intensity of implication, entropic intensity and implifiance in ordinates as a function of the number of counter-examples in cases $n=100$ and $n=1000$ (respectively in Figures~\ref{chap2fig9} and~\ref{chap2fig10}. + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1.3]{chap2fig9.png} +\caption{Example of Implifiance with $n=100$.} + +\label{chap2fig9} +\end{figure} + +\begin{figure}[htbp] + \centering +\includegraphics[scale=1.3]{chap2fig10.png} +\caption{Example of Implifiance with $n=1000$.} + +\label{chap2fig10} +\end{figure} + + +\section{Annex 3: SIA and Hempel's paradox} + +If we look at the SIA from the point of view of Knowledge Extraction, we find the main objective of the inductive establishment of rules and quasi-rules between variables $a$ and $b$ observed through instances $x$ of a set $E$ of objects or subjects. A strict rule (or theorem in this case) will be expressed in a symbolic form: $\forall x, (a(x)\Rightarrow b(x))$. A quasi-rule will present counter-examples, i.e. the following statement will be observed: $\exists x, (a(x)\wedge \overline{b(x)})$. + + +The purpose of the SIA is to provide a measure to such rules in order to estimate their quality when the frequency of the last statement above is low. +First, within the framework of the SIA, a quality index is constructed in order, like other indices, to provide a probabilistic response to this problem. +But in seeking among the rules\footnote{$n_{a \wedge \overline{b}}$} those that would express a causality, a causal relationship, or at least a causal relationship, it seemed absolutely necessary to us, as we said in point 4, to support the satisfaction of the direct rule by a measure of its contraposition: $\forall x, (\overline{b(x)} \Rightarrow \overline{a(x)})$. +Indeed, if statistically, whether with confidence measured by conditional frequency or with intensity of implication, the truth of a strict rule is also obtained with its counterpart, this is no longer necessarily the case with a quasi-rule. +We have also sought to construct in a new and original way a measure that makes it possible to overcome Hempel's paradox~\cite{Hempel} in order to obtain a measure that confirms the satisfaction of induction in terms of causality. + + +It should be recalled that, according to Carl G. Hempel, in strict logic, this paradox is linked to the irrelevance of contraposition in relation to induction, whereas empirical non-satisfaction (de facto) with premise $a$ is observed. +It is the consequence of the application of Hempel's 3rd principle: "If an observed object $x$ does not satisfy the antecedent (i.e. $a(x) = false$), it does not count or it is irrelevant in relation to the conditional (= the direct proposition)". +In other words, the confirmation of the contraposition does not provide anything as to the direct version of the proposal, although it is logically equivalent to it. +For example, it is not the confirmatory observation of the contraposition of "All crows are black" by a red cat (i. e. not black) that confirms the validity of "All crows are black". Nor, for that matter, by continuing to observe other non-black objects. Because to confirm this statement and thus validate the induction, we would have to review all the non-black objects that can be infinite in number. + +In other words, according to Hempel, in the implication truth table, cases where $a(x)$ is false are uninteresting for induction; only the lines [$a(x)=true$ and $b(x)=true$] that confirm the rule and [$a(x)=false$ and $b(x)=true$] that invalidate it, are retained. +\\ + +\underline{However, in SIA, this paradox does not hold for two reasons:} + + +\begin{enumerate} +\item the objects $x$ are part of the same finite or unfinite reference set $E$, i.e. infinite, countable and even continuous, in which all $x$ are likely, with relevance, to satisfy or not satisfy the variables at stake. That is, by assigning them a value (truth or numerical), the direct proposition and/or its counterpart are also evaluable (for example, proposition $a \Rightarrow b$ is true even if $a(x)$ is false while $b(x)$ is true); +\item Since we are most often dealing with quasi-rules, the equivalence between a proposal and its counterpart no longer holds, and it is on the basis of the combination of the respective and evaluated qualities of these statements that we induce or not a causal character. Moreover, if the rule is strict, the logical equivalence with its counterpart is strict and the counterpart rule is satisfied at the same time. +\end{enumerate} diff --git a/references.tex b/references.tex index d8cae26..38f2536 100644 --- a/references.tex +++ b/references.tex @@ -204,6 +204,9 @@ Cépaduès Ed. Toulouse, p. 195-208, ISBN: 978.2.36493.577.8. \bibitem{Guillet} Guillet, F., Hamilton, H. J. (2007) Quality measures in data mining (Vol. 43). Springer. +\bibitem{Hempel} Hempel C. (1945) Studies in the Logic of Confirmation, Oxford University Press + + \bibitem{Jacquard} Jacquard A. (2001) La science à l’usage des non-scientifiques », p.159, 2001. @@ -328,6 +331,25 @@ Cépaduès Ed. Toulouse, p. 195-208, ISBN: 978.2.36493.577.8. \bibitem{Thoma} Thom R. (1983) Paraboles et catastrophes, Champs Sciences. \bibitem{Thomb} Thom R. (1993) Prédire n'est pas expliquer, Champs Sciences. +\bibitem{Toivonen} Toivonen H., Klementtinen M., Ronkairen P., Hätönen K. and Manila H..(1995) Pruning and grouping of discovered association rules. Workshop notes of the ECML Workshop on Statistics, Machine Learning and Knowledge Discovering in Databases, p. 47-52. + +\bibitem{Tomasis} Tomasis A., (1977), L’oreille et la vie, Paris: R. Laffont. + +\bibitem{Trinh} Trinh X. T. (1998), Le chaos et l’harmonie, la fabrication du réel, Librairie Arthème Fayard, Paris. + + \bibitem{Vergnauda} Vergnaud G. and Durand C.(1976), Structures additives et complexité psychogénétique, Revue Française de Pédagogie n° 36. + + \bibitem{Vergnaudb} Vergnaud, G. (1981). Quelques orientations théoriques et méthodologiques des recherches françaises en didactique des mathématiques – Recherches en Didactique des Mathématiques, 2.2, p. 215-232. + + \bibitem{Vergnaudc} Vergnaud, G. (1990). La théorie des champs conceptuels. Recherches en Didactiques des Mathématiques, 10 (23), p. 133-170. + + \bibitem{Vergnaudd} Vergnaud G. (2007), Activités humaines et conceptualisation, Presses Universitaires du Mirail, p. 29. + + \bibitem{Vygotsky} Vygotsky L. (1997). Pensée et langage (1933) (traduction de Françoise. Sève, avant-propos de Lucien Sève), suivi de Commentaires sur les remarques critiques de Vygotski de J. Piaget,(Collection Terrains, Éditions Sociales, Paris, 1985), Rééditions La Dispute, Paris. + + \bibitem{Zadeha} Zadeh L. A., (1979), A Theory of Approximate Reasoning, J. Hayes, D. Michie, and L.I. Mikulich eds., Machine Intelligence 9, New York: Halstead Press, p. 149-194. + +\bibitem{Zadehb} Zadeh L. A. (1997), Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic, Fuzzy Sets and Systems 90, p. 111-127. %% % and use \bibitem to create references.