Data Classification

Definition

As we examine these data, it is difficult to tell, without lengthy scrutiny, just how they are distributed. We find, after some searching, that the smallest observation is 426 and the largest observation is 740. Also, it becomes apparent that there are few observations below 500 or above 700. But we cannot quickly tell whether there are as many observations between 500 and 550 as between 650 and 700. We need to arrange the data so that the main features will be clear. This arrangement is called data classification.

data classification

Array in data classification

When data are arranged in order from smallest to largest, we have what is known as an array. Now, it is obvious, after a brief examination of the array, that the observations in the 500s make up about half the 100 observations. We just started with data classification and got results. That the observations in the 600s account for about another 40 percent. And that observations less than 500 or greater than 700 account for only about 10 percent. We are able to learn more with less effort than we were when the data were not arranged.

But still, the data must be studied in order to draw these conclusions. Many people do not like to examine a mass of numbers, and many others don’t have the time to do so. Therefore, it would be advantageous if the information present in the array of observations could somehow be ‘compressed’ so that the distribution of the observations could be seen at a glance.

Table with SAT-Verbal Deta—Arranged in Order to do data classification

426 536 572 605 644
457 536 578 609 645
464 541 578 612 645
483 541 579 618 650
489 541 586 618 656
490 546 586 618 663
496 546 591 619 663
496 546 592 619 663
502 546 592 622 666
502 546 597 622 666
503 547 599 624 669
515 547 599 624 673
515 547 599 631 689
515 549 599 631 689
528 555 599 635 695
528 555 599 637 695
530 557 602 641 708
531 560 602 641 721
534 560 603 644 734
534 567 605 644 740

Learn more about statistics here

Class intervals or classes

The device for data classification is used to ‘compress’ the data. The range of the observations (in this case 740 — 426 = 314) is divided into a number of class intervals, or simply classes. Although the class intervals do not have to be equal, there are important advantages if they are. Consequently, we will use equal class intervals exclusively.

Importance of classifying data into classes

We must decide how many classes we wish to have. For large samples (over fifty observations, say), from ten to twenty classes will usually do nicely. For smaller samples, fewer classes can be used—as few as five or six, perhaps. It should be emphasized that the number of classes is arbitrary. Given the same data, one person might do data classification into twelve classes, another into fourteen, and yet another into only nine. In most problems (assuming a large number of observations), fewer than ten classes will result in too much information being lost And if more than twenty are used, the work involved in analysing the data becomes more and more lengthy.

But let us return to the problem of deciding what the value of k, the number of classes, should be here while we do data classification. The range is 314 units. If we use ten classes, the width of each class interval would be 31-4 units; if twenty classes are used, the width of each class interval would be 15-7 units. Any convenient number between 15-7 and 31-4 will do for the width of the class interval. We will use 13 classes, each of width 25 units.

Boundries in data classification

Just as the number of classes and the width of the class intervals are arbitrary, so is the point at which to begin the lowest class in data classification. We could begin the first class at 425. Thus the first class would be from 425 to 450, the second from 450 to 475, the third from 475 to 500, and so on. The numbers 425, 450, 475, and 500 are known as class boundaries in data classification. they separate one class from another. These boundaries are not well chosen, however, because it is not clear what should be done with certain values, such as 475.

Therefore, in data classification, should we put 475 into the lower class, into the upper class, or into both? The difficulty is not serious, and can be avoided if we specify the classes like this: 425 but not 450, 450 but not 475, 475 but not 500, and so on.

Another way around this difficulty is to use class boundaries, which are more accurate than the observations. If the observations are given to the nearest integer the boundaries should be given correct to the nearest half. If the observations are given correct to tenths while doing data classification, then the boundaries should be given correct to twentieths, and so on. Using this procedure, boundaries for the first three classes would be (arbitrarily beginning at 424-5) 424 5—449-5, 449-5—474 5, 474-5—499-5.

Class limit in data classification

The smallest and largest possible measurements in each class are called the class limits. Classes are sometimes specified in terms of the class limits. If this is done, there is no overlap as there was in the first example of selecting class boundaries, because the largest possible observation in one class cannot be the smallest possible in another class while doing data classification. Specified in terms of their class limits, the first three classes would be 425—449, 450—474, and 475—499. If the scores had been reported to the nearest tenth of a unit, then scores of 449-9 as well as 425-0 would be possible.

With this more accurate measurement, the class limits of the first three classes would be 425 0-4t9 9, 450 0-474-9, and 475’0-499 9. When the classes are described in terms of the class limits, each boundary is understood to be half-way between the upper class limit of the lower class and the lower class limit of the upper class for data classification. For the class limits 425—449, 450-474, and 475-499 the class boundaries are 424-5, 449 5, 474-5, and 499-5.

Class mark

The midpoint of a particular class interval is the point half-way between the class boundaries of that class and is called the class mark in the field of data classification. If the class boundaries are 424 5—449-5, 449 5—474-5, 474-5—499-5, then the class marks are 437, 462, 487. If the class boundaries are 424 95—449 95, 449-95—474-95, 474 95—499 95, then the class marks would be 437-45, 462-45, 487 45, …

Summarizing these results, we have successfully done data classification as follows:

  • Class interval 425—449
  • Class limits  lower limit 425, upper limit 449
  • Class boundary 424 5, 449-5
  • Class mark  4 4‘5 + 449-5 – 437
Class frequency in data classification

The number of observations in any particular class is called the class frequency of that class. The class frequency of the ith class (there are k classes, so i can be any integer from 1 to k) is denoted by ft. Thus i is the class frequency of the first class, fz that of the second class, and so on. Since there are k classes, the class frequency of the last class is denoted fx in data classification.

Reply