Data Classification

Statistics By Safi Ur Rehman Qamar October 9, 2024 No Comments

Definition

As we examine these data, it is difficult to tell, without lengthy scrutiny, just how they are distributed. We find, after some searching, that the smallest observation is 426 and the largest observation is 740. Also, it becomes apparent that there are few observations below 500 or above 700. But we cannot quickly tell whether there are as many observations between 500 and 550 as between 650 and 700. We need to arrange the data so that the main features will be clear. This arrangement is called data classification.

data classification

Array in data classification

When data are arranged in order from smallest to largest, we have what is known as an array. Now, it is obvious, after a brief examination of the array, that the observations in the 500s make up about half the 100 observations. We just started with data classification and got results. That the observations in the 600s account for about another 40 percent. And that observations less than 500 or greater than 700 account for only about 10 percent. We are able to learn more with less effort than we were when the data were not arranged.

But still, the data must be studied in order to draw these conclusions. Many people do not like to examine a mass of numbers, and many others don’t have the time to do so. Therefore, it would be advantageous if the information present in the array of observations could somehow be ‘compressed’ so that the distribution of the observations could be seen at a glance.

Table with SAT-Verbal Deta—Arranged in Order to do data classification

426	536	572	605	644
457	536	578	609	645
464	541	578	612	645
483	541	579	618	650
489	541	586	618	656
490	546	586	618	663
496	546	591	619	663
496	546	592	619	663
502	546	592	622	666
502	546	597	622	666
503	547	599	624	669
515	547	599	624	673
515	547	599	631	689
515	549	599	631	689
528	555	599	635	695
528	555	599	637	695
530	557	602	641	708
531	560	602	641	721
534	560	603	644	734
534	567	605	644	740

Learn more about statistics here

Class intervals or classes

The device for data classification is used to ‘compress’ the data. The range of the observations (in this case 740 — 426 = 314) is divided into a number of class intervals, or simply classes. Although the class intervals do not have to be equal, there are important advantages if they are. Consequently, we will use equal class intervals exclusively.

Importance of classifying data into classes

We must decide how many classes we wish to have. For large samples (over fifty observations, say), from ten to twenty classes will usually do nicely. For smaller samples, fewer classes can be used—as few as five or six, perhaps. It should be emphasized that the number of classes is arbitrary. Given the same data, one person might do data classification into twelve classes, another into fourteen, and yet another into only nine. In most problems (assuming a large number of observations), fewer than ten classes will result in too much information being lost And if more than twenty are used, the work involved in analysing the data becomes more and more lengthy.

But let us return to the problem of deciding what the value of k, the number of classes, should be here while we do data classification. The range is 314 units. If we use ten classes, the width of each class interval would be 31-4 units; if twenty classes are used, the width of each class interval would be 15-7 units. Any convenient number between 15-7 and 31-4 will do for the width of the class interval. We will use 13 classes, each of width 25 units.

Boundries in data classification

Just as the number of classes and the width of the class intervals are arbitrary, so is the point at which to begin the lowest class in data classification. We could begin the first class at 425. Thus the first class would be from 425 to 450, the second from 450 to 475, the third from 475 to 500, and so on. The numbers 425, 450, 475, and 500 are known as class boundaries in data classification. they separate one class from another. These boundaries are not well chosen, however, because it is not clear what should be done with certain values, such as 475.

Therefore, in data classification, should we put 475 into the lower class, into the upper class, or into both? The difficulty is not serious, and can be avoided if we specify the classes like this: 425 but not 450, 450 but not 475, 475 but not 500, and so on.

Another way around this difficulty is to use class boundaries, which are more accurate than the observations. If the observations are given to the nearest integer the boundaries should be given correct to the nearest half. If the observations are given correct to tenths while doing data classification, then the boundaries should be given correct to twentieths, and so on. Using this procedure, boundaries for the first three classes would be (arbitrarily beginning at 424-5) 424 5—449-5, 449-5—474 5, 474-5—499-5.

Class limit in data classification

The smallest and largest possible measurements in each class are called the class limits. Classes are sometimes specified in terms of the class limits. If this is done, there is no overlap as there was in the first example of selecting class boundaries, because the largest possible observation in one class cannot be the smallest possible in another class while doing data classification. Specified in terms of their class limits, the first three classes would be 425—449, 450—474, and 475—499. If the scores had been reported to the nearest tenth of a unit, then scores of 449-9 as well as 425-0 would be possible.

With this more accurate measurement, the class limits of the first three classes would be 425 0-4t9 9, 450 0-474-9, and 475’0-499 9. When the classes are described in terms of the class limits, each boundary is understood to be half-way between the upper class limit of the lower class and the lower class limit of the upper class for data classification. For the class limits 425—449, 450-474, and 475-499 the class boundaries are 424-5, 449 5, 474-5, and 499-5.

Class mark

The midpoint of a particular class interval is the point half-way between the class boundaries of that class and is called the class mark in the field of data classification. If the class boundaries are 424 5—449-5, 449 5—474-5, 474-5—499-5, then the class marks are 437, 462, 487. If the class boundaries are 424 95—449 95, 449-95—474-95, 474 95—499 95, then the class marks would be 437-45, 462-45, 487 45, …

Summarizing these results, we have successfully done data classification as follows:

Class interval 425—449
Class limits lower limit 425, upper limit 449
Class boundary 424 5, 449-5
Class mark 4 4‘5 + 449-5 – 437

Class frequency in data classification

The number of observations in any particular class is called the class frequency of that class. The class frequency of the ith class (there are k classes, so i can be any integer from 1 to k) is denoted by ft. Thus i is the class frequency of the first class, fz that of the second class, and so on. Since there are k classes, the class frequency of the last class is denoted fx in data classification.

Recommended For You

The Mode, Median and Arithmetic mean

Safi Ur Rehman Qamar

Write For Us

Biology Learn

Absorption Spectrum

Action potential

Amino acid residue

AIDS