APPLICATION OF PYTHON LIBRARIES FOR VARIANCE, NORMAL DISTRIBUTION AND WEIBULL DISTRIBUTION ANALYSIS IN DIAGNOSING AND OPERATING PRODUCTION SYSTEMS

The use of statistical methods in the diagnosis of production processes dates back to the beginning of the 20th century. Widespread computerization of processes made enterprises face the challenge of processing large sets of measurement data. The growing number of sensors on production lines requires the use of faster and more effective methods of both process diagnostics and finding connections between individual systems. This article is devoted to the use of Python libraries to effectively solve some problems related to the analysis of large data sets. The article is based on the experience related to data analysis in a large company in the automotive industry, whose annual production reaches 10 million units. The methods described in this publication were the basis for the initial analysis of production data in the plant, and the obtained results fed the production database and the created automatic anomaly detection system based on artificial intelligence algorithms.


INTRODUCTION
Serial production reaching tens or even hundreds of millions of pieces a year is not unusual these days. Modern factories full of robots and automation continuously produce huge amounts of goods. However, the level of automation requires proper control of the production processes. It is usually connected with the necessity of installation of a large number of sensors that record the condition of machines and the quality of manufactured products. The number of sensors is as high as 10 3 in the case of a medium-sized company, and even 10 5 in the case of very large factories. Therefore, the production state at time can be described as series of measurement data = ( 1, , 2, , … , , ), where , are the measurement results from individual sensors. By writing down the state of the company in successive moments of time 1 < 2 < ⋯ < we obtain a matrix that reflects the changes in the production process: Data from the measurement matrix are analyzed in order to detect abnormalities or specific relationships between events. Even an apparently small matrix becomes computationally very demanding if it is DIAGNOSTYKA, Vol. 22, No. 4 (2021) Chmielowiec A, Klich L.: Application of python libraries for variance, normal distribution and Weibull … 90 necessary to independently analyze the sub-matrices contained in it. Statistical tools are very often used to determine interrelationships between events and detect irregularities. They allow to automatically process the data contained in such a matrix and indicate interesting relationships between the variables that make up the state of the production process. This article presents the basic tools of statistical analysis that can be used in detecting dependencies and interrelationships between the values of the presented matrix.
The study of the statistical properties of time series from the production process is a very common method of assessing the quality of this process today. Advanced methods of data analysis allow you to control the quality of the process, assess the reliability or test the resistance of the design. The theory of reliability draws a lot from statistical methods, as evidenced, inter alia, by the monographs of Barlow and Proschan [10], Ansell and Phillips [5], Johnson and others [41], Birolini [11], Woo [94], Grynchenko and Alfyorov [30]. Her research area is the variability of the quality function over time, which is very well expressed in terms of the probability calculus. Schedules modeling the life cycle of machines and devices allow for effective management of production lines -their operational reliability and quality of manufactured elements. A particularly important concept for this field is the Weibull distribution [91,92], whose precursors were Frechet [29] and Fisher and Tippett [27]. This distribution plays a special role in the theory of reliability, as evidenced by, for example, the publications of Johnson [40] and Lai [49]. A lot of detailed information about it can be found, inter alia, in the monographs of Murthy [65], Lai [50] and McPherson [61]. The statistical process control [63,95,62] initiated by Shewart [81] is today a highly developed method of managing the production process. It covers both the analysis of single variable functions [63,95] and the analysis of multivariable functions [56,59]. The robust design methods introduced by Taguchi [83] also make significant use of a variety of statistical tools. Their use in production companies resulted in a two-fold [43], and in some cases even a four-fold reduction in the variability of the production process [70]. The breakthrough achievements in this field include the results published by: Kacker [42], Leon et al. [52], Box [13], Nair [66] and Tsui [87]. Taguchi's methods have also been extended to design resistance based on multiple characteristics. These problems are raised, inter alia, by: Logothetis and Haigh [55], Pignatiello [71], Elsayed and Chen [23] and Tsui [88]. The common feature of the issues described above is the intensive use of statistical tools on data sets from the production process. It should be emphasized that as the number of sensors and the size of databases increases, more and more emphasis is placed on processing efficiency. Therefore, the main task of this publication is to show how modern IT tools and numerical methods help to deal with some problems of data analysis.
Statistical process control is part of a much wider problem of statistical analysis, which is searching for various types of anomalies in a time series. This problem has been intensively researched over the last 20 years. Many algorithms and techniques for finding anomalies have been proposed, provided that the range of an anomaly is known more or less. Examples include the results obtained by Keogh et al. [45,46] and Senin et al. [77]. Nevertheless, it is still a huge challenge to search the entire set of available data. As an example illustrating the level of complexity of the problem, we can use a onedimensional sequence 10 5 of observations of a single quantity -it may be, for example, one of the dimensions of the manufactured element. Let us emphasize that a production run of this size is nothing extraordinary and is easily achieved in batch production conditions. There are over 1.6 · 10 14 subseries for such a series, which may contain various types of anomalies. This example shows how complicated the situation is when there is no information about the potential location of an anomaly. The publication [17] gives some intuitions about trying to solve this problem from the algorithmic point of view, but it can be said that this is only the discovery of the tip of the iceberg. It should be emphasized that the level of complexity of such problems increases significantly when data matrices appear instead of simple sequences/vectors. A good example of the complexity of this issue is the review article by Ebner and Henze [22], which describes the methods and problems of testing normality in multidimensional spaces.
Contemporary methods of finding anomalies in time series can be divided into three main groups depending on the results they generate. We distinguish algorithms for finding anomalies at a point, structural anomalies and series anomalies (in the case of multiple series). By an anomaly at a point, we understand the deviation of the values of a single measurement from the values in the series [32,25]. In turn, structural anomalies are such subsets of a given series, whose statistical properties differ from those determined for the entire series [44,97,77]. In some way, series anomalies are related to this issue, which consist in finding deviations between entire sequences of measurements [38,51]. Algorithms using artificial intelligence methods are gaining popularity recently. This is undoubtedly a future direction of research, as evidenced by, for example, Intel's commitment to this area. The experience gained by the company in this area was published by Wang et al. [90]. Also, the publication by Chalapathy and Chawla [15] provides an overview of deep machine learning methods for detecting anomalies in time series.
The article [21] introduces the division of anomaly detection algorithms in time series according to the type of the method used. Ding and others DIAGNOSTYKA, Vol. 22, No. 4 (2021) Chmielowiec A, Klich L.: Application of python libraries for variance, normal distribution and Weibull … 91 distinguished between the following methods: classification, nearest neighborhood, clustering, and statistical methods. It should be emphasized that all the last three methods are more or less based on statistical inference and the calculus of probability. Therefore, the rapid determination of statistics is a very important issue in the context of the performance of anomaly detection systems.
This publication is divided into two parts. The first part presents selected methods of statistical analysis from the theoretical point of view, and the second part presents a practical approach -the implementation of the previously described methods in Python. Of course, the approach to big data analysis described in the following sections does not exhaust the catalog of methods that can be used in such a context. Nevertheless, the three issues described constitute a kind of basis that can be used for more advanced calculations and machine learning.

THEORETICAL FOUNDATIONS OF SELECTED METHODS OF STATISTICAL DIAGNOSTICS
In this section, the methods of variance analysis and two probability distributions will be considered. They play a key role from the point of view of quality management and reliability theory. These distributions are the normal (Gaussian) distribution and the Weibull distribution. The description of individual will be presented with particular emphasis on the implementation aspect. The automation of the results evaluation and the calculation speed are extremely important in the case of processing and searching large measurement data sets. At the end of the part, the algorithm for processing a large set of measurement data is presented. Its task is to supplement the database with additional statistics that can be used for graphical analysis and machine learning.

Variance analysis
Variance is a measure of the concentration of data around the mean value. The lower the variance, the more concentrated the results. On the other hand, a high value of the variance indicates the statistical dispersion and significant distances between the points of the analyzed data set. Under production conditions, high process variability may indicate quality problems. Therefore, it is worth starting the data analysis with the analysis of variance. At this point, it should be noted that the process analysis cannot assume the study of variance only within measurement sequences with a certain length. Determining the time window size can give an incomplete picture of the process. Therefore, in further considerations, a chronologically ordered measurements sequence 1 , … , representing the values of one of the random variables defined by the production process will be taken into account.
Suppose that for each element a sequence of variances will be determined that will be computed for the surroundings of this element. It means that a certain sequence of radii 1 , … , , has been established, for which we calculate the variances where = 2 + 1. Note that the biased variance estimator was intentionally used in the formula above. This is due to the fact that it is more efficient to implement, and the transition to the unbiased estimator requires only multiplying the obtained quantity by 1 + (2 ) −1 . Additionally, in this article we will assume that successive rays increase by a certain predetermined amount . It means that = ⋅ . For such assumptions, it can be shown that the independent determination of all variances , for ∈ {1, … , } and ∈ {1, … , } requires about 2 /6 operations. If we assume that = / for some fixed , then the computational complexity of the algorithm determining the set of variances , has order ( 3 / ). In practice, it means that for a small number of 10 4 measurements and value = 100 it is necessary to perform 10 10 operations. Therefore, to determine the entire set of variances, we will use the approach presented in Lemma 1 [17]. It will allow to reduce the computational complexity of the problem under consideration to the level of ( 2 / ), which will significantly improve the efficiency of calculations. For example, for the aforementioned 10 4 measurements, the computation time will be reduced 10 000 times.
The Welford formula [93,14] in the form given by Knuth [48] will be used for the effective implementation of the algorithm for determining the set of variances , . It assumes that if ( ), ( ) are the arithmetic mean and variance of the series of measurements s, respectively, then for = ( 1 , … , ) and + = ( 1 , … , , +1 ) the following dependencies occur This formula will be used in the initial stage of calculations -until the sequence s reaches the size 2 + 1. In the next stage, the window method presented in [17] will be used. It uses the fact that for = ( 1 , … , ) and ′ = ( 2 , … , +1 ) the following equations are satisfied ( ′) = ( ) + +1 − 1 , The above-mentioned relationships allow for the definition of a computationally effective algorithm, which was presented in [17].

Normal distribution
The normal distribution as a limit distribution for the law of large numbers appears very often in the practice of quality management. Many models assume that the features of a process or a product can be modeled by means of a random variable X = µ+Y, where µ is a fixed value and the random variable Y has a normal distribution. However, this condition is not always met. There are moments when a random variable ceases to have a normal distribution. Detecting such a situation is crucial from the quality control point of view, as it may indicate a disturbance in the production process. Therefore, methods for verifying the normality of decomposition are a very important part of the control of the production process.
Testing the normality of a distribution is already a classic problem in probability theory and statistics. The first analyzes of this issue were already carried out by Fisher [26] and Pearson [69] in the interwar period. There are many examples of testing a onedimensional normal distribution. One of the most frequently used tests are the Anderson Darling [4], Shapiro-Wilk [79], Shapiro-Francia [78] and Kolmogorov-Smirnov [60,54] tests. On the other hand, the analysis of skewness and kurtosis (third and fourth moments) was used to construct the first normality tests for multivariate distributions [9,57,58]. At the end of the 20th century, Bowman [12] proposed a normality test of multidimensional distributions based on the smoothness of density.
Vasicek, in turn, developed a test using the entropy of the normal distribution [89]. The literature also offers several proposals for tests based on characteristic functions, which include [24,18,37], the BHEP test [8,36] and energy tests [82]. Testing the normality of distribution, especially in the multivariate case, is currently a very intensively researched issue. This is evidenced by the publications of recent years by authors such as Mori et al. [64], Henze and Visagie [35], Tenreiro [84], Thas and Ottoy [85] and Zhu et al. [98]. Extensive reviews of methods for testing normality of distribution can be found in the works of Henze [34] and Das and Imon [19].
Recall that the normal distribution is a twoparameter distribution denoted as ( , 2 ), where is the expected value and is the standard deviation. Figure 1 shows the probability density plot for the parameters = 0 and = 1 along with the marked intervals [−1,1], [−2,2] i [−3,3], the integral of which is respectively 0.683, 0.954 and 0.997. This integral determines the probability with which the random variable assumes values from the specified interval. The density function for the normal distribution is given by the formula It is not the purpose of this article to present a systematic description of all known tests to verify the normality of a distribution. The extensive literature provided at the beginning of this section allows the reader to get acquainted with specific methods. Examples of the use of selected methods will be presented later in the article as part of the use of specific libraries for statistical analysis. However, in order to outline the complexity of statistical testing, the Shapiro-Wilk test description will be presented. This test is today considered to be one of the strongest tests of normality. Its main disadvantage, however, is numerical limitations that make it impossible to test large samples. Currently used libraries for statistical analysis allows to carry out the test on vectors consisting of about 5000 samples. Nevertheless, it is recommended to verify the maximum number of input data that can be properly examined by a given implementation in the numeric packet documentation. The essence of the Shapiro-Wilk test is to compare the variance of the sample with the variance that should have a normal distribution if the data actually came from such a distribution. Thus, this test answers the question to what extent the sample has a chance to represent a normal distribution. The Shapiro-Wilk test statistic is where the sequence = ( 1 , … , ) is a sequence of positional statistics (the sequence terms are sorted in ascending order) and is the average value of the given statistics sequence.
where m is the vector of expected values for the sample of positional statistics of size , and the matrix is the covariance matrix of the position statistics from the sample and the vector . Computational problems related to the determination of vector a coefficients are a fairly current topic. They were already mentioned by Shapiro and Wilk [79], Royston [75,74,76] had a large contribution to this issue, and recently Gunner and others [31] presented a method using the Shapiro-Wilk test in fast signal processing. It should be noted that operating on vector a approximations is also a fairly common approach. One of the most commonly used relationships is where ̂= −1 (( − 0.375)/( + 0.25)).
The abbreviated description of the Shapiro-Wilk test presented above is only a sketch of the problems related to this test and is only intended to draw the reader's attention to the fact that the problem of testing the normality of distribution is a very extensive topic. Therefore, before using specific functions that test normality, it is worth taking the time to become familiar with the properties of the test you want to use. Particular attention should be paid to the limitations of a given method.

Weibull distribution
As mentioned in the introduction, the Weibull distribution is an example of a distribution that models the lifetime of a product. Recent years have shown its intensive use in issues related to modeling: glass strength [47], progressive pitting corrosion [80], adhesive wear of metals [72], failure of coatings [3], failure of brittle materials [28], failure of composite materials [67], wear of concrete elements [53], fatigue life of aluminum alloys [33], fatigue life of Al-Si castings [1], strength of polyethylene terephthalate fibers [96] or failure rate of joints under the influence of shear [7]. This very wide range of applications, however, does not require automatic processing of large data sets. However, it shows how universal the Weibull distribution is when it comes to testing the failure rate and aging of products.
The Weibull distribution was treated in a completely different way in the publication [16], which shows how this distribution can be used to optimize operating costs. The generalization of the approach presented there, for example for all replacement parts used in a large enterprise, is already associated with the statistical and optimization analysis of a large set of data (in general, the number of replacement parts used reaches thousands). Another potential application may be the optimization of warranty servicing costs of cars sold by a given manufacturer. In this context, we are dealing with an even larger database, as generally a car model is sold in hundreds of thousands, and sometimes even millions of copies.
The probability density of the three-parameter Weibull distribution is given by the equation where > 0, ≥ 0, ≥ . For the adopted notations, is called the scale parameter, is the shape parameter, and is the position parameter. Figure 2 shows example plots of the probability density function for the Weibull distribution. An overview of possible methods can be found in the publications of Ross [73] and Jacquelin [39]. It is worth emphasizing here that the form of the Weibull distribution gives the possibility of using a substitution that transforms it into a linear relationship. Thanks to this, it is very easy to connect the methods based on probability plots [86,6,20,68,2] with the method of determining parameters using linear regression. It can therefore be concluded that this distribution fits exceptionally well in the automatic analysis of large data sets.

Anomaly detection algorithm
Let us assume that we have the time series ( ) = ( 1 , 2 , … , ) defined on the discrete set of arguments 1 , 2 , … , . We treat arguments as the moments of time in which the enterprise's operating parameters were measured, represented by a tuple ( 1 , 2 , … , ). Let us also assume that the search for anomalies in the production process is carried out based on the time window of the size w. That is, the parameters corresponding to the arguments +1 , +2 , … , + are subjected to statistical analysis. Of course, the Algorithm 1 presented in this section may be applied to multiple time window widths as needed. Nevertheless, we present it in the version for fixed in. ( ) = ( 1 , 2 , , , )function family defined for each ∈ {1, 2, … , }, where 1 is number of tests for which variance was greater than given bound value B, 2 is number of tests for witch does not pass normality test, V is variance of window for which was the center, N is normal distribution parameters vector of window for which was the center and W is Weibull distribution parameters vector of window for which was the center.

PROCEDURE:
For ∈ {1, … , } and ∈ {1, 2, … , } find ( ) = ( 1, , 2, , , , ): Increase 2, for all ∈ { − ℎ, … , + ℎ} It should be noted that it makes sense to determine either the vector of the parameters of the normal distribution N or the vector of parameters of the Weibull distribution W. It depends on the type of measured value and its nature. Nevertheless, the most important quantities in the presented procedure are 1 and 2 , because they show how often a given sample behaved abnormally during statistical tests.

PYTHON IMPLEMENTATION OF SELECTED METHODS OF STATISTICAL DIAGNOSTICS
Data analysis methods were implemented in free Python language and available statistical tools (numpy -basic library for creating and analyzing multidimensional arrays; pandas -library for reading, creating and manipulating data of various types). Using Python with these libraries allows to create simple scripts for data analyze and knowledge discovery. These tools, along with a wide range of libraries for data visualization, create a set of readymade tools for preparing analyzes and visualizing production processes. The practical implementation of statistical research is based on a real log from the production machines saved in a CSV file. The structure of the log file: Name; Date; Number; Time; Type; Time; Machine; Source; User; Par1; Par2; Par3; Par4; Par5; Par6; Par7; Par8; Par9; Par10; Par11; Par12; Par13; Par14; Par15; Par16; Par17; Par18; Par19; Par20, where: − nameproces name; − dateproces date; − numbermachine numer; − timeproces start time; − typeproces type; − sourceproces source; − userproduction worker; − par1:par20numerical parameters of the current process. Not all parameters from file are analyzed. For this reason, in the part devoted to the implementation of selected studies, methods of selectively retrieving only the columns that will be analyzed will be presented. The data in the file was also filtered to analyze a certain range. To read data from the log file, the reading is the same and is done using the read_csv method. The program presented in Listing 1 (Annex A SOURCE CODES) imports the pandas library and reads from the log.csv file to the selected columns of the DataFrame object. The columns, separated by semicolons, contain the observations. The "shape" method in the df object created from the DataFrame class was used to display the number of cases and the parameters of the set. The last line displays the first 10 observations of the set, which facilitates the initial evaluation of the analyzed data set. This example will be used, with some slight modifications, in subsequent descriptions of the conducted statistical research.

Python implementation of variance analysis
The variance's implementation, i.e. examining the distribution of values in a data set around its mean value, can be valuable information while analyzing data from production logs. The methods available in Python statistics packages provide ready-made tools to calculate variance. You can calculate the variance of a set by calling the var() method on the DataFrame object. The method returns the unencumbered by variance against the requested axis, with 0 representing the rows and 1 representing the columns. The algorithm takes an optional "ddof" argument, which default's value is 1. This parameter indicates the degrees of freedom that will be used in the calculation. A value of 0 for the "ddof" parameter calculates the variance for the population, and a value of 1 estimates the population variance from the selected sample. The Listing 2 shows the variance calculation for all columns in a population. Variance calculation applies only to the numeric columns. For this reason, only columns containing numeric values were loaded into memory. Because the variance results are close to zero, they will be displayed exponentially. This record may make the analysis difficult, therefore the conversion to real values was performed with the use of the lambda function. Additionally, the number precision parameter was modified, narrowing down the result to 8 decimal places. The result of the program is a list of variances: Special attention should be paid to the display of real numbers. Their display has been truncated to the specified precision by the {: .8f} parameter. In some cases, this can lead to potential errors because the precision is too low. Therefore, the precision should always be selected on the basis of the complete results.
Apart from the standard method, the pandas package also offers calculation of variance with the use of the window method, which is especially useful when analyzing large production data sets. The calculation of the window variance is available in the "rolling" method. The argument in this case is the size of the window. In the example of calculating the window variance, it was assumed that the variance studies will take place in the windows of size 4 and 15, and the tested parameters will be columns P2, P5 and P7. In the case of window variance, it is best to assign the calculation results to a new DataFrame object. As a result of the program in Listing 3, the variance for the selected series will be displayed. In the case of a window variance, where window size = 4, the rows indexed [0: 3] will not be filled with data. The number of blank lines containing NaN (not a number) values will increase as the window grows. For example, increasing the window size to 10 will cause the display of the first 10 results to look as follows: vP2 vP5 vP7 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN 6 NaN NaN NaN 7 NaN NaN NaN 8 NaN NaN NaN 9 NaN NaN NaN A simple solution in these types of cases is to remove redundant lines containing NaN values with an additional line (before the print function): This time, the listing only contains rows that contain numeric values, and the index of the case in the object has not changed.
The Seaborn library was used to visualize the results of calculating the variance, which is an extension of the Matplotlib library, enriching the standard package with additional types of graphs. Generating plots for the purpose of visualizing variance poses the problem of legibility in the plot of a relatively large amount of data. The Figure 3 shows a graph of variance for several columns, where time covers the entire population. As one can see, the chart is not legible, although it allows for a preliminary evaluation of the set. A graph will look clearer and provide better perception when it is generated with the use of the filtering methods available in the pandas package. In the example in Listing 4, there are filters that date range from 5/1/2020 to 5/2/2020. Filtering before performing a statistical analysis is a frequent operation, especially in the case of long time series analysis, because a sample which is too large is not always desirable during numerical calculations, and additionally placing too large time period on the graph may distort the perception of reception and, consequently, interfere with the possibility of noticing certain regularities or anomalies. By filtering a certain range (sample) of the population, the plot must contain a subset that contains only the range of the series of interest to be analyzed. The example uses date range filtering, but the filter can also include the set values specified with extreme parameters. The Figure 4 shows a graph presenting the variance of the P2 parameter in the separated date range. Thanks to the use of filtering, the chart is DIAGNOSTYKA, Vol. 22, No. 4 (2021) Chmielowiec A, Klich L.: Application of python libraries for variance, normal distribution and Weibull … 96 much more readable and allows the evaluation the parameters value from a segment of the production process. Additionally, the chart shows the division into machines on which the production process took place. The size of the points on the chart and the colors allow a better perception of reception, and thus, an observation of potential disturbances in the configuration of the parameters of production machines. As one can see in the graph, there are no major differences in the variance of a single parameter depending on the machine, however, if there is an abnormality on one of the machines, it will be clearly indicated with a point along with the machine number specification.

Fig. 4. Parameter P2 variance plot -date range
The filtering capabilities of the DataFrame object are much broader. They also allow the use of logical operators. Thanks to this, greater filtering precision is possible, which is necessary when it is required to get rid of anomalies resulting from errors which occur while saving or transferring log files. In such cases, the erroneous observations should be removed during the set cleaning operation. An additional protection against the analysis should be the removal by filtering of the values exceeding the limit values. In this way, you can eliminate invalid lines that clearly deviate from the range of minimum and maximum production values in the production process. Filtering will eliminate potentially false anomalies during the analysis. The above code is an example of the use of both date range and extreme manufacturing process filtering.
The analysis of variance can be freely extended, e.g. to search for correlations between the available parameters and other non-numeric parameters. As a result, you can get a broader picture of the studied phenomenon and notice some correlations. For example, comparing the variance of the series parameters on the graph with the division into production machines or the operators who support them, it is possible to analyze the differences in the configuration of individual machines, or errors made by the operators of these machines. The collected and processed results can be of great help in predicting production processes and allow for confirming or rejecting the proposed hypotheses concerning disturbances in production processes.

Python implementation of Normal distribution
When examining data sets, one of the basic principles is to pre-examine its distribution of features. There are many methods in Python statistical libraries to test the normality of a distribution. These methods return the P-value probability, which is the lowest significance level at which the null hypothesis for the observed value of the test statistic can be rejected (the null hypothesis can be rejected when the computed test probability in the p variable turns out to be not higher than the adopted significance level, which is usually 0.05). In the case when the value of p ≤ 0.05, the hypothesis can be rejected, and when the value of p > 0.05, probably the variable has values derived from the normal distribution. The examination of the series for the assessment of the normality of the distribution can be done in two ways. The first is the graphical test, i.e. visualization of the variable distribution using one of the available charts. The second method is to perform one of a number of statistical tests to obtain the P-value.
The following program serves as an example of a graphical test, which generates random values of the normal distribution. For this purpose, the random.normal method from the numpy library was used, which takes the amount of generated data as one of the parameters. Generating the series allows the visualization of the plot for a normal distribution. The result of the program in Listing 5 is a histogram presenting the generated values of the data series and is shown in the Figure 5. The analysis of the graph shows that the series can come from a normal distribution. Another way to generate a series from the normal distribution is to use the norms method from the scipy library as is shown in Listing 6. In this case, the numpy library was also used to generate the series from the normal distribution, but this time the linspace method was used to generate the series. The  In order to perform a statistical test of the normality of the distribution, an actual sample from the population was selected (based on a date range), and additionally, anomalies that could be clearly classified as errors were eliminated from the set. As a result, a time series containing only the correct values of one of the parameters of the production process were obtained -Listing 7.
Graphical methods for assessing the normality of a distribution often use a histogram for a numerical parameter. In this type of chart, data is represented by a certain number of rectangles with sorted data containing the number of observations. Histograms divide the values of continuous variables into discrete sections and show the number of values in each of these ranges as is shown in the Figure 7. Additionally, the chart uses the kde = True option to generate the chart kernel density estimation to calculate the probability density function of a random variable. The interpretation of the quantile chart is based on the observation of the concentration of points around the straight line. If a variable is normally distributed, its values agree with theoretical quantiles. An important information when interpreting a Q-Q Plot is the even alternating distribution of points close to the straight line, which may mean that the data comes from a normal distribution. Even if there are slight deviations of a few points above and below the line, if the deviations are small, the data series may be derived from the normal distribution.
The visual assessment of the graph in terms of the normality of the distribution is always associated with misinterpretation, therefore statistical tests are needed to verify the null hypothesis when examining the distribution. To this end, we will introduce several series-testing algorithms for distribution that are available in the Python statistical package.
The first test is the Shapiro-Wilk test, which is designed to test the normality of distributions in collections of up to 5000 samples. It is available in the shapiro package, and the test implementation looks as is shown in Listing 9. The result of the test is the display of the number of observations and the results of the statistical test, similar to subsequent tests.
Observations: 949, parameters: 2 0.9902371764183044 6.105211468820926e-06 The Shapiro-Wilk test may yield erroneous results when testing larger samples and is the preferred way to test the normality of a probability distribution due to its strength.
Another test that will be used is the D'Agostino-Pearson test, which assumes that the sample size should not be smaller than 20 observations. Its implementation on an identical row is shown in Listing 10. The next test performed is the Kolmogorov-Smirnov test, which belongs to the non-parametric tests for assessing the compatibility of the distribution of variables with the normal distribution. It is a test used in situations where the mean or standard deviation is unknown. The method should be used for samples with n > 100. Like other tests, it tests the null hypothesis of a distribution close to the normal distribution -Listing 11.
Another test that will be performed is based on the Anderson-Darling method and is based on statistical tests of the consistency of the distribution with a given standard distribution. It tests the null hypothesis of a sample from a population with a specific distribution. Critical values depend on the tested distribution. The method works for normal, exponential, logistic, or Gumbel distribution. By default, the tested distribution is the normal distribution, but it can also take other distributions: expon, logistic, gumbel, gumbel_l, gumbel_r, extreme1) -Listing 12. The result of the executed series testing program using the Anderson-Darling test is: The next statistical test available is the chi-square test. This method is often used to verify the hypothesis whether the observed trait in a community has a specific type of distribution -Listing 13.
The second to last of the presented statistical tests is the Lilliefors test. It is a test based on the Kolmogornov-Smirnov test, which can be used when the mean value and standard deviation are unknown. The test implementation is presented in Listing 14.
The last of the tests is the Jarque-Ber distribution normality test, which is one of the tests frequently used in econometrics due to the uncomplicated form of the asymptotic distribution. The construction of the test statistic is based on the values of the moments of distribution of a random variable calculated on the basis of the empirical sample and comparing them with the theoretical moments of the normal distribution. This test verifies the hypothesis of univariate normality of a random variable against any other distribution and is presented in Listing 15.

Python implementation of Weibull distribution
Reliability engineering and survival analysis are an important stage of research into the prediction of production processes. There are several reliability analysis libraries for Python. In this case, the Reliability library was used for the research, which contains a set of functions useful in this type of analysis. The library is an extension of scipy.stats and contains additional tools useful in testing reliability.
With the help of the library, you can create distribution matches for both complete and incomplete data, and the available fit modules are named with the number of their parameters. For example, Fit Weibull 2P uses α, β and Fit Weibull 3P uses α, β, γ. The distributions are fitted using the requested function along with the errors passed to it. A minimum of 4 samples is recommended as the accuracy of the fit depends on this. For the purpose of the experiment, the Fit Weibull 2P method was used. The program in Listing 16 imports the log from the production machine and determines parameter errors in the form of an additional logical column. In this case, the errors result from exceeding the production values. In the next step, the Weibull fitting function is called, which displays the results of the calculations. By analyzing the result of the program, you can read the number of error samples. The output also includes confidence intervals and standard error for parameter estimates. The probability plot is generated automatically using the plt.show() method and is presented in the Figure 9.
On the generated graph, you can observe how the data is modeled, and the interpretation of the visualization consists in analyzing the location of points that should be placed on a straight line. In this case, however, it is not so. A misalignment occurs when a line or curve formed by points deviates significantly from a straight line. In the plot interpretation, slight deviations from the straight line are tolerated at the ends of the distribution, but most points should follow the straight line. To display the failure points next to PDF, CDF, SF, HF, or CHF points without using axis scaling, one can use the plot_points function to generate a plot -Listing 17.

CONCLUSIONS
Systems for the analysis of big data are playing an increasingly important role in industry, and this trend applies to all sectors. A strong impulse for the development of this field of knowledge is the constant increase in the importance of information and the need to compete in global markets. Exploring the collections in terms of obtaining specific knowledge is a way to automate the delivery of answers to previously asked questions. The information from production systems, analyzed and prepared in a human-readable way, can be used for faster prediction of failures and component wear or process monitoring.
It is worth using commonly available tools such as Python and a number of available libraries for this purpose. While only the Python environment was used in the research, there are a number of convenient dedicated environments, such as Jupyter and Spyder. These tools are equipped with convenient interfaces and a built-in help system, which greatly facilitates and speeds up work on the harvest.
This article presents selected issues of automatic analysis of data from production processes. In this way, the authors wanted to show how extensive the subject is. The second goal was to demonstrate how modern programming tools can be used to support data analysis in enterprises. The desire to show specific possibilities results from the fact that many modern IT systems make very little use of these modern technologies.