Quantitative analysis of breast cancer diagnosis using a probabilistic modelling approach

Background : Breast cancer is the most prevalent cancer in women in most countries of the world. Many computer-aided diagnostic methods have been proposed, but there are few studies on quantitative discovery of probabilistic dependencies among breast cancer data features and identification of the contribution of each feature to breast cancer diagnosis. Methods : This study aims to fill this void by utilizing a Bayesian network (BN) modelling approach. A K2 learning algorithm and statistical computation methods are used to construct BN structure and assess the obtained BN model. The data used in this study were collected from a clinical ultrasound dataset derived from a Chinese local hospital and a fine-needle aspiration cytology (FNAC) dataset from UCI machine learning repository


Introduction
Breast cancer is the most prevalent cancer in women around the world.It has been reported that approximately 1.3 million women worldwide have been diagnosed with breast cancer since 2011, and approximately 465,000 women die from breast cancer each year [1].In China, 214,360 women had died from breast cancer by 2008.It has been estimated that the number of Chinese women with breast cancer will reach 2.5 million by 2021 [2].According to a report published by the Chinese National Cancer Centre in 2017, breast cancer is the most common cancer in Chinese women.
Following lung, stomach, liver, oesophageal and colorectal cancers, breast cancer is the sixth largest killer in small-and medium-sized cities, with a mortality rate of 8.44% and 9.59%, respectively, while the mortality rate from breast cancer in large-sized cities is 12.78%, making it the fifth most common cause of death among all cancer types in Chinese women [3].Due to the rapid increase in the number of breast cancer patients, early identification of women at risk of developing breast cancer is currently an international priority [4].
In order to improve diagnostic accuracy and help domain experts to make more effective decisions, many computer-aided diagnosis (CAD) systems have been developed [5]- [7].They provide new computational algorithms combined with domain knowledge to support clinical diagnosis.Zeng et al. [8] proposed different nonlinear state-space models for lateral flow immunoassay, which have been commonly used in clinical diagnosis.In clinical medicine, breast cancer could also be diagnosed via several different techniques, such as ultrasound, fine-needle aspiration cytology (FNAC) and magnetic resonance imaging (MRI) scanning.
Other CAD algorithms in [12] were also proposed to detect breast cancer.For example, Eltoukhy et al. [12] proposed a feature extraction method based on a statistical t-test for breast cancer diagnosis from a digital mammogram.They used wavelet and curvelet methods to transform digital

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
3 mammography data into vector coefficients and then employed a support vector machine (SVM) algorithm for breast cancer diagnosis.As a result, the highest diagnostic accuracy based on wavelet and curvelet coefficients was 96.56% with 1238 features and 97.30% with 5663 features [12].
As a well-established probabilistic classifier, Bayesian network (BN) analysis has been used widely for data analytics and data modelling in many healthcare area, such as psychotic depression [15], Alzheimer's disease [16], heart disease [17] and social anxiety [18] as well as breast cancer.
Wang et al. [19] proposed a three-layer BN for earlier diagnosis of breast cancer.They assessed the performance of the BNs constructed based on non-imaging features, imaging features and both.They found that the BN built on both imaging and non-imaging features performed well and that imaging features dominated BN performance.In 2007, Nicandro et al. [20] evaluated the performances of seven BN classifiers (i.e.Naïve Bayes classifier, Bayes-N, MP-Bayes, Greedy, MP-Bayes + Greedy, PC which is a procedure contained the Tetrad, and a CBL2 algorithm in Power Constructor, which is a software package containing CBL1 and CBL2 algorithms) for breast cancer diagnosis based on fine-needle aspiration from a breast lesion collected by a single observer and multiple observers.
They found that the classifiers learnt from different data performed differently, which indicated that the observations would impact the breast cancer diagnostic result.Furthermore, in 2009 [21], Nicandro and his team made use of two decision trees and four different BNs for breast cancer diagnosis.Their study discovered interobserver variability in breast cancer cytodiagnosis, indicating that different observers would focus on different perspectives while making a diagnostic decision.Kalet et al. [22] designed a Bayesian model to detect a misdiagnosis made at the initial diagnostic stage of a disease such as lung, brain or female breast cancer.The BN model they designed produced a better AUC (0.98) than a decision made by clinical experts (0.90).
Additionally, BN was also used in other studies of breast cancer, such as risk factor estimation [23], and causal interaction detection [24].Nicandro et al. [25] employed a score-based BN approach to estimate the power of thermography for breast cancer diagnosis.The BNs were learned using Naïve Bayes, hill-climber and repeated hill-climber algorithms with a minimum description length (MDL) metric.The BN learned by a repeated hill-climber algorithm provided the best accuracy for both cancer and non-cancer diagnosis (75.50 ± 6.99%) and sensitivity of cancer diagnosis (94%).Furthermore, their obtained BN identified five important features for breast cancer diagnosis: 1C (hottest point in only one breast), f unique (total number of hottest points), thermovascular network (number of veins with the highest temperature), curve pattern and asymmetry (temperature difference between the right and left breasts).
Although a BN modelling approach has been used for breast cancer diagnosis, a report on quantitative analysis among different breast cancer features, which is critically important for clinical decision making, is lacking.As a result, some researchers might ignore the relationships between different features, which may lead to a high misdiagnosis rate [19] [26].A clearly explained BN in a medical area can increase the understanding of disease pathology and provide valuable decisionmaking assistance to domain experts.This paper employed a BN modelling approach to discover the probabilistic relationships between different data features of breast cancer.We also analysed the contribution of each feature to breast cancer diagnosis.The data were focused on ultrasound and FNAC examinations obtained from The First Affiliated Hospital of Fujian Medical University, China and the Breast Cancer Wisconsin Dataset (BCWD) of the UCI machine learning repository [27].
BN modelling can be deconstructed into two sub-processes: structure learning and parameter learning.In this study, a K2 learning algorithm [28] with an MDL score metric was used to learn the BN structure.Our reasons for using a K2 algorithm were the following: 1) K2 is the most commonly used algorithm for BN structure learning [29], 2) K2 is relatively easy to implement [29], 3) K2 only needs to consider a subset of a directed acyclic graph (DAG) and can quickly find the variable with the local maximal score [30] and 4) a K2 algorithm makes good use of experts' knowledge to learn the BN structure.The contributions of this study are 1) we discovered the most important features which can provide uninitiated observers and doctors objective and quantitative guidance to focus on specific features for early breast cancer diagnosis.2) We analysed the probabilistic dependencies among different data features and identified the strength of the dependency, which can assist the domain experts in making a quantitatively accurate diagnosis, even using fewer available features.A focus on different features by different observers [21] may cause them to miss some important features, which can significantly influence diagnostic results.The above two contributions are helpful in decreasing the misdiagnosis rate.3) Our study showed a potential translational application of the BN modelling approach to the breast cancer care pathway.
The remainder of this paper is organized as follows.Section 2 provides the basic theory of BN in detail, as well as a brief introduction about the technique of BN visualization.Section 3 presents the experimental results based on two real-world datasets.Section 4 discusses the results and evaluates the BN modelling approach in comparison with other methods.Finally, Section 5 concludes this paper and discusses potential extensions of the method in future work.

Methods and materials
Numerous approaches have been developed to support breast cancer diagnosis.BN analysis has been used widely to improve diagnostic accuracy and to discover probabilistic relationships among features and the influence of joint probability distribution inference.

Bayesian network
A BN represents a domain which explicitly provides a set of variables belonging to a specific domain and visualizes the relationships between the variables [31].It can successfully represent uncertain knowledge in various fields [30].A BN is usually represented using a DAG, ( , ) where V denotes a set of nodes made up by a set of variables, and E denotes a set of edges between the nodes in V.No cycles are present in the DAG [32].Each edge is directly linked from one node to another, and it indicates that the corresponding two nodes are mutually dependent.Otherwise, nodes are independent if there is no link between them.
Consider a given dataset D containing a set of variables , the joint probability distribution on X, where i π is the set of parents of i X .

Structure learning
BN modelling process contains structure learning and parameter learning.Structure learning aims at identifying the topology of the network in order to display the relationships among the nodes.
Parameter learning quantitatively finds how a node relates to its parent nodes [33].BN structure learning methods are normally classified into search-and-score-, constraint-and dynamic programming-based categories [30].Compared with the other two categories, the search-and-scorebased category is suitable for large data sets in the whole feature space to assist in finding an exact structural topology [30].As the name implies, it needs a search strategy and a score metric [33].
The K2 learning algorithm [28] is one of the most commonly used search strategies.It starts with a set of ordering nodes, and each node initially has no parents.It then, according to a certain order, iteratively adds the parent nodes for the node of interest, i.e. if i X is preceded by j X ( i j ≠ ), the edge from j X to i X is not allowed.Assume there is a set of ordering nodes, and each node has i r states.i π is initialized with empty at the beginning.The function represents a set of nodes preceding i X in the X.The K2 learning algorithm will add the node m X of ( ) X makes the score new f larger than the old score old f .The formula of the scoring function f is given by equation ( 2) [28].
where i q is the number of all possible values of i π ; ijk N is the number of cases in the given dataset in which the node i X is in the kth state, and its parent i π is in the jth state; and

Parameter learning
The parameters of BN are denoted as ijk θ , which is the conditional probability distribution of node i X when it takes the kth value and its i π takes the jth value, i.e.
This learning process is implemented by expectation-maximization (EM) algorithm [34].

Strength of influence
BN structures the probabilistic dependency between different nodes, which can be explained in different ways [35].Koiter [36]  The strength of the influence (SI) between nodes is determined by the mean distance between different posterior probability distributions [36], i.e.

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
where n is the number of states about a node.A and B are the directly linked nodes.( ) P A is the a priori probability of A, and ( | ) is the posterior probability of B, given a certain state of A.
D is the distance function between these two probability distributions.

Materials
We applied a BN modelling approach combined with expert knowledge to analyse the probabilistic relationships among data features with respect to ultrasound and FNAC examinations for breast cancer diagnosis.

Clinical ultrasound dataset
We collected a total of 1993 complete data samples from clinical ultrasonic examination in The First Affiliated Hospital of Fujian Medical University, China.The data are composed of 1494 benign (non-cancerous) and 499 malignant (cancerous) samples.Each sample contains five ultrasonic features of the breast tumour: shape (SH), resistances index (RI), calcification (CA), blood signal (BS) and the diameter-to-width ratio (DW).
The feature SH is either regular or irregular and was labelled as either 0 or 1, respectively.RI has three categories: none, less than 0.7 and greater than or equal to 0.7, labelled as 0, 1 and 2, respectively.Therefore 0.7 is an important cut-off point to distinguish between benign and malignant tumours in clinics.The feature CA contains no calcification, micro-calcification and macrocalcification.In our experiment, each CA category was labelled with 0, 1 and 2, respectively.Feature BS has two categories, i.e. no and yes, labelled with 0 and 1, which respectively indicates that a blood signal in the breast tumour is absent or present.In clinics, the BS in a malignant tumour is richer than that in a benign tumour.The two categories of the feature DW were labelled as 0 (the tumour's diameter is greater than its width) and 1 (the tumour's width is greater than or equal to its missing.The ultrasound data are described in Table 1.

FNAC examination dataset
The BCWD of the UCI machine learning repository contains a total of 683 complete FNAC data samples (444 benign and 239 malignant).The nine scored cytological features are bare nuclei (BANU), uniformity of cell size (UCSI), single epithelial cell size (SECS), uniformity of cell shape (UCSH), normal nucleoli (NN), marginal adhesion (MA), bland chromatin (BC), clump thickness (CT) and mitosis (MITO).All these features were computed from a digital image of a breast tumour FNAC examination, and they could be used to describe the characteristics of the cell nuclei shown in the image.Each feature was scored using an integer value ranging from 1 to 10, where 1 represented the most benign characteristic, and 10 represented the most malignant characteristic.According to [27], each sample was classified into either the benign or the malignant diagnostic category, labelled as 0 or 1, respectively.The FNAC data description are detailed in Table 1.

Data pre-processing
As BN structure learning is based on discretized data, we discretized the ultrasound data features according to their corresponding definitions, while each FNAC feature was discretized into two categories according to [37], [38] (see Table 1).We then applied an information gain algorithm [39] to rank each feature's relevance to the diagnostic result (i.e.benign or malignant).The obtained information gain score (IGs) of each feature is listed in descending order in the last column of Table 1.Finally, a K2 learning algorithm was applied to learn the BN structure of the given datasets with IGs-ordered features.All experiments were carried out on a PC using Weka [40] and GeNIe [41] software.

Results
We The BN model in Fig. 2 (a) depicts that all four ultrasound features have influences on diagnosis.
The links between SH, RI and BS show their relatedness, while CA is independent of the other three.
The BN model in Fig. 2

Discussion
In order to verify and validate the obtained BNs, statistical analysis based on pure data was also carried out.Fig. 5 shows the true-positive rate (TPr, malignant tumour is correctly classified as malignant), the true-negative rate (TNr, benign tumour is correctly classified as benign) and the overall correct diagnostic accuracy, using diamond, square and triangle markers, respectively, in terms of individual ultrasound features and their combinations.
The result suggested that, considering individual features, SH presented the highest diagnostic accuracy and TNr and a higher TPr.Although BS had the highest TPr, its accuracy and TNr were very low.Therefore, the feature SH presented higher diagnostic performance than other features, while BS showed the lowest diagnostic accuracy.With respect to the combinations of any two features, SH combined with RI presented the highest TNr of 0.995, a high TPr of 0.886 and the highest accuracy of 0.967.In terms of combinations of any three features, the combination of SH, RI and CA contributed the highest TNr of 0.999, a high TPr of 0.9 and the highest malignant tumour diagnosis accuracy of 0.978.Furthermore, the combination of all features (i.e.SH + RI + BS + CA)

M A N U S C R I P T A C C E P T E D
showed the highest performance in terms of TPr (0.934), TNr (0.999) and malignant tumour diagnosis accuracy (0.982).This is very consistent with the obtained BN shown in Fig. 2 (a).
Many CAD methods have been proposed to support breast cancer diagnosis.Some studies [19][26] assumed that the features of breast cancer data should be independent, i.e. the Naïve Bayes (NB) modelling method would be appropriate.In addition, the decision tree methods, such as ID3 and J48, are also commonly used [21], while the NBTree is a classifier of the combination of the Naïve Bayes and decision tree methods.In this paper, we evaluated the performance of these representative and frequently-used methods; we then compared them with that of a BN modelling approach based on ultrasound and FNAC datasets using 10-fold cross-validation in order to obtain a fair comparison of classification accuracy for breast cancer diagnosis.
Table 2 lists the sensitivity (TPr) and specificity (TNr) of malignant tumour diagnosis, the overall correct classification accuracy for both benign and malignant classes and the AUC (area under the ROC curve) performance in terms of the five classifiers applied to the two given datasets.For the ultrasound dataset, J48 showed the highest performance in terms of sensitivity (0.9110), specificity (0.9100), accuracy (0.909) and AUC (0.941), followed by ID3 (sensitivity 0.8847, specificity 0.9083, accuracy 0.899 and AUC 0.9258) and lastly, the BN (sensitivity 0.8762, specificity 0.8978, accuracy 0.898 and AUC 0.9298).The BN approach performed competitively with ID3, with an even higher AUC than that of ID3, while NB and NBTree presented relatively worse performance.Although J48 performed the best, it could not successfully represent the probabilistic relationships among different data features (see the J48 tree constructed on ultrasound data in Fig. 6).Moreover, the diagnostic result in terms of the RI feature is unacceptable.According to the definition of RI, if the RI is less than 0.7, the tumour will normally be considered as a benign sign; otherwise, it will be considered as a malignant sign.However, the obtained J48 tree classified the tumour into the malignant category regardless of whether its RI was less than 0.7.We then reviewed our obtained BN model, and it could correctly diagnose a tumour with an RI less than 0.7 as benign.Apparently, this verified that the BN model can provide more accurate and reasonable diagnosis than the J48 tree.For the FNAC dataset, the BN performed best in terms of sensitivity (0.9659) and specificity (0.9617) of malignant tumour diagnosis, correct classification accuracy of 0.965 and an AUC of 0.9887.
Hence, the experimental results verified that the BN model is more reliable than other models with which it was compared to discover probabilistic dependencies among data features for breast cancer diagnosis and for diagnostic classification accuracy improvement.More importantly, with the BN modelling approach, FNAC examination showed more accurate performance in terms of malignant tumour diagnosis and the correct classification for both malignant and benign tumours than ultrasound examination.However, FNAC is more complex than ultrasound examination.
To enhance the above analysis, we also calculated three classification errors, i.e. mean absolute error (MAE) [42], root mean square error (RMSE) [43] and relative absolute error (RAE) [43], of the five classifiers applied to the two datasets.Furthermore, the kappa coefficients (κ) of the five classification models based on ultrasound and FNAC datasets were calculated.The κ is an essential metric to evaluate the agreement between different classifiers [44].The assessment of κ was based on Landis and Koch [45].They considered 0.21 ≤ κ ≤ 0.40 to indicate fair agreement, 0.41 ≤ κ ≤ 0.60 to indicate moderate agreement, 0.61 ≤ κ ≤ 0.80 to indicate tentative agreement and 0.81 ≤ κ ≤ 1 to indicate definite agreement.Fig. 8 showed that all five models constructed from ultrasound data were in stronger agreement with a random classifier, which indicated that they all performed well in breast cancer diagnosis.Especially, ID3 showed the highest κ of 0.7434, followed by BN (κ = 0.7378).For FNAC, the five models worked perfectly, as the rankings (in descending order) of κ of BN, NB, NBTree, ID3 and J48 were 0.9222,

M A N U S C R I P T A C C E P T E D
ACCEPTED MANUSCRIPT 14 0.9114, 0.8941, 0.8881 and 0.8681, respectively.This is further evidence that FNAC would be a more effective examination than ultrasound for breast cancer diagnosis, which is also consistent with our above-mentioned experimental results shown in Table 2.

Conclusions
This study employed a BN modelling approach to support decision making on whether a breast tumour is diagnosed as benign or malignant.A benign breast tumour can be cured by treatment, but a malignant tumour, which is one of the most terrible diseases, cannot be completely cured.Early diagnosis would be SH, followed by RI and CA.CA is independent of other features, but it cannot be ignored when making a diagnostic decision.RI is tightly associated with BS.We can select RI as a measure for diagnosis, but cannot use BS alone due to its weak influence.In addition to the ultrasound features we investigated, other features from ultrasound scanning were accessible, such as echo halo and posterior shadowing; however, they fall into the area of medical image processing, which is outside the scope of this study.In terms of the FNAC dataset, the best-performing marker for breast cancer diagnosis is BANU, followed by UCSI, UCSH, NN, MITO and MA.Clinicians can refer to the importance ranking of these features when making diagnostic decisions.Due to the strong dependencies between UCSI and UCSH, a breast tumour will be considered benign if both the size and shape of the cells are uniform.Malignancy should be associated with a large number of BANU.
This study can provide a valuable guide and assist observers in focusing on the most valuable features when collecting data; it can also indirectly aid breast cancer oncologists in making more accurate diagnoses when few features are known.Moreover, the BN-observed probabilistic relationships between different clinical features of breast tumours, identified from both ultrasound and FNAC examinations, can help oncologists to make more exact inferences, especially for specific patients with some missing feature values.In the meantime, the BN modelling approach can be extended to the diagnosis of other diseases in the healthcare area, which is our potential application.
Although the BN modelling approach performed well in our study, it requires that the data used be discrete [46].Therefore, we need to discretize continuous data prior to data modelling.However, it is difficult to determine a stable cut-point value for discretization of specific features, which may cause inflated performance estimates.Therefore, a regression method that suits continuous data will be investigated in our future work.
Breast cancer contains many subtypes at the molecular level [47], and the treatments may differ between subtypes [49].Our study so far focuses on two datasets collected from different countries,

M A N U S C R I P T A C C E P T E D
and the data only reflect binary diagnostic categories, which might limit our research in its current stage.However, this study is attempting to fill a requirement using machine learning and data modelling approaches to discover holistic quantitative relationships among breast cancer data, which are few in the area.We are not searching for a new aetiology, and the experimental results from our study based on available data are consistent with the known aetiology of breast cancer.This can be used to validate our approach for use with potential new data types.Therefore, in next research stage, we will collect more data types and time-evolved data from our collaborative hospital to extend our study not only to breast cancer diagnosis but also to prognosis.

M A N U S C R I P T A C C E P T E D
proposed a comprehensive technique to visualize the inference in a BN in order to clearly understand the constructed BN.The technique can show the strength of the influence between two directly linked nodes by automatically adjusting the thickness of the corresponding arcs.Koiter made use of a dynamic model to show this influence without considering the direction of the arc in a BN.The proposed dynamic model is illustrated in Fig.1, where each node represents a variable.The red crossed sign denotes the node as a target node, and the green question mark means that the BN has not yet been updated.Four possible cases in Fig.1 are as follows: (1) an arc from a target node A to a non-target node B, e.g. the influence B-to-A (Fig.1 (a)); (2) an arc from a non-target node A to a target node B, e.g. the influence A-to-B (Fig.1 (b)); (3) an arc between two target nodes (Fig.1 (c)) and (4) an arc between two non-target nodes (Fig.1 (d)).Both cases (3) and (4) illustrate the influences in both direction, e.g. the influence A-to-B and B-to-A.
However, our study excluded the feature DW because 1305 out of 1993 DW values were randomly split both datasets into 10 folds for cross validation.Each fold of the ultrasound dataset contained 1000 training data points (750 benign samples and 250 malignant samples) and 993 testing data points (744 benign samples and 249 malignant samples).For FNAC data, each fold contained 355 training data points (230 benign samples and 125 malignant samples) and 328 testing data points (214 benign samples and 114 malignant samples).After 10-fold cross validation, the obtained BNs were visualized and are shown in Fig.2, where the rectangular boxes stand for feature nodes.In each box, the feature's name is labelled in the upper cell, the categories of each discretized feature and the corresponding probability distributions are shown in the lower left cell and the lower right cell intuitively shows the probabilities using different coloured bars.The arrows indicate the probabilistic influence between two features, and the thickness of the arrow represents the strength of the corresponding influence.The thicker the arrow, the stronger the influence.Since diagnosis in the obtained BN models (Fig.2) was set as a target while other features were non-target, the influences between diagnosis and other features should be in line with Fig.1 (a) case, while the influences among non-targets is in line with Fig.1 (d).

Fig. 4 (
Fig.4 (b) clearly shows the strength of the dependencies among FNAC features.The highest triangle

Fig. 7 (
a) shows that BN constructed on the ultrasound dataset obtained the lowest MAE (0.143) and RAE (0.381).ID3 had the lowest RMSE (0.269), and BN had the second lowest RMSE (0.271).For the FNAC dataset, BN presented the lowest MAE (0.044), RMSE (0.179) and RAE (0.098) (see Fig.7 (b)).This further strengthens the competitiveness of the BN model in comparison with other representative models.
diagnosis and treatment will improve patient survival.A BN modelling approach has been used widely for breast cancer diagnosis, but a report quantitatively analysing the contributions of different breast cancer features to diagnosis is lacking.Moreover, studies about probabilistic relationships between different breast tumour features based on ultrasound examination and FNAC examination, respectively, are few.In this paper, we attempted to meet these two challenges using a BN modelling approach.The BN model we obtained can support clinical decisions in an automated manner by using a set of machine learning algorithms.We used a clinical ultrasound dataset from a local hospital and an FNAC dataset from openly sourced UCI machine learning repository.Numeric data features were discretized in line with their descriptions.The BNs were structured according to a K2 learning algorithm and validated by 10-fold cross validation.We carried out extensive experiments to evaluate the performance of the employed BN modelling approach and four commonly used methods, i.e.NB, ID3, J48 and NBTree, in terms of sensitivity and specificity of malignant tumour diagnosis, the overall correct classification accuracy, and the AUC metrics.It turned out that the BN model is competitive and promising for breast cancer diagnosis.The BN model can explicitly present the probabilistic relationships between directly linked data features.SI mirrors the importance of different features to diagnosis and the intensity of dependence between features.The most important ultrasound marker with respect to breast cancer M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 15

Fig. 1 .
Fig.1.Four cases of a dynamic BN model.Each elliptic node stands for a variable.The red crossed

Fig. 3 .
Fig.3.Strength of the influences between diagnosis and each single feature in terms of ultrasound (a) and FNAC (b) datasets.

Fig. 4 .
Fig.4.Strength of the dependencies among different features in terms of ultrasound (a) and FNAC (b)

Fig. 5 .
Fig.5.Performance of each individual feature and their combination in diagnosis based on the