Chapter 10

From Structured Analysis Wiki
Jump to: navigation, search

Data Dictionaries


“Dictionaries are like watches; the worst is better than none, and the best cannot be expected to go quite true.”

-- Mrs. Priozzi Anecdotes of Samuel Johnson, 1786



Contents

Introduction

IN THIS CHAPTER, YOU WILL LEARN:


  1. Why we need a data dictionary in a systems development project;
  2. The notation for data dictionary definitions;
  3. How a data dictionary should be presented to the user; and
  4. How to implement a data dictionary.


The second important modeling tool that we will discuss is the data dictionary. Though it doesn’t have the glamour and graphical appeal of dataflow diagrams, entity-relationship diagrams, and state-transition diagrams, the data dictionary is crucial. Without it, your model of the user’s requirements cannot possibly be considered complete; all you will have is a rough sketch, an “artist’s rendering” of the system.

The importance of a data dictionary is often lost on many adults, for they have not used a dictionary for 10 or 20 years. Try to think back to your elementary school days, when you were constantly besieged with new words in your schoolwork. Think back also to your foreign language courses, particularly the ones that required you to read books and magazines. Without a dictionary, you would have been lost. The same is true of a data dictionary in systems analysis: without it, you will be lost, and the user won’t be sure you have understood the details of the application.

The phrase data dictionary is almost self-defining. The data dictionary is an organized listing of all the data elements that are pertinent to the system, with precise, rigorous definitions so that both user and systems analyst will have a common understanding of all inputs, outputs, components of stores, and intermediate calculations. The data dictionary defines the data elements by doing the following:


  • Describing the meaning of the flows and stores shown in the dataflow diagrams.
  • Describing the composition of aggregate packets of data moving along the flows, that is, complex packets (such as a customer address) that can be broken into more elementary items (such as city, state, and postal code).
  • Describing the composition of packets of data in stores.
  • Specifying the relevant values and units of elementary chunks of information in the dataflows and data stores.
  • Describing the details of relationships between stores that are highlighted in an entity-relationship diagram. This aspect of the data dictionary will be discussed in more detail in Chapter 12 after we have introduced the entity-relationship notation.




The need for data dictionary notation

In most real-world systems that you will work on, the packets, or data elements, will be sufficiently complex that you will need to describe them in terms of other things. Complex data elements are defined in terms of simpler data elements, and simple data elements are defined in terms of the legitimate units and values they can take on.

Think, for example, about the way you would respond to the following question from a Martian (which is the way many users think of systems analysts!) about the meaning of a person’s name:

Martian: “So what is this thing called a name?”
You (shrugging impatiently): “Well, you know, it’s just a name. I mean, like, well, it’s what we call each other.”
Martian (puzzled): “Does that mean you can call them something different when you’re angry than when you’re happy?”
You (slightly amazed at the ignorance of this alien): “No, of course not. A name is the same all the time. A person’s name is what we use to distinguish him or her from other people.”
Martian (suddenly understanding): “Ahh, now I understand. We do the same thing on my planet. My name is 3.141592653589793238462643.”
You (incredulous): “But that’s a number, not a name.”
Martian: “And a very good name it is, too. I’m proud of it. Nobody has anything close.”
You: “But what about your first name? Or is your first name 3, and your last name 1415926535?”
Martian: “What’s this about first name and last name? I don’t understand. I have only one name, and it’s always the same.”
You: “Well, that’s not the way it works here. We have a first name, and a last name, and sometimes we have a middle name too.”
Martian: “Does that mean you could be called 23 45 99?”
You: “No, we don’t allow numbers in our names. You can only use the alphabetic characters A through Z.”


As you can imagine, the conversation could continue for a very long time. You might think the example is contrived, because we rarely run into Martians who have no concept of the meaning of a name. But it is not too far from the discussions that take place (or should take place) between a systems analyst and a user, in which the following questions might be raised:


  • Must everyone have a first name? What about the character “Mr. T” on the old TV series, “The A Team”?
  • What about punctuation characters in a person’s last name; for example, “D’Arcy”?
  • Are abbreviated middle names allowed, for example, “John X James”?
  • Is there a minimal length required of a person’s name? For example, is the name “X Y” legal? (One could imagine that it would wreak havoc with many computer systems throughout the country, but is there any legal/business reason why a person couldn’t give himself a first name of X and a last name of Y?)
  • How should we treat the suffixes that sometimes follow a last name? For example, the name “John Jones, Jr.” is presumably legitimate, but is the Jr. to be considered part of the last name or a special new category? And if it is a new category, shouldn’t we allow numeric digits, too; for example, Sam Smith 3rd?
  • Note, by the way, that none of these questions has anything to do with the way we will eventually store the information on a computer; we are simply trying to determine, as a matter of business policy, what constitutes a valid name.<ref>On the other hand, it is likely that the business policy presently in place has been strongly influenced by the computer systems that the organization has been using for the past 30 years. Fifty years ago, someone might have been considered eccentric if he decided to call himself “Fre5d Smi7th” but it probably would have been accepted by most organizations, because names were transcribed onto pieces of paper by human hands. Early computer systems (and most of the ones in place today) have a lot more trouble with such nonstandard names.</ref>


As you can imagine, it gets rather tedious describing the composition of data elements in a rambling narrative form. We need a concise, compact notation, just as a standard dictionary like Webster’s has a compact, concise notation for defining the meaning of ordinary words.


Data dictionary notation

There are many common notational schemes used by systems analyst. The one shown below is among the more common, and it uses a number of simple symbols:


= is composed of
+ and
( ) optional (may be present or absent)
{ } iteration
[ ] select one of several alternative choices
** comment
@ identifier (key field) for a store
| (or ;) separates alternative choices in the [ ] construct


As an example, we might define name for our friendly Martian as follows:


name = courtesy-title + first-name + (middle-name) + last-name
courtesy-title = [Mr. | Miss | Mrs. | Ms. | Dr. | Professor]
first-name = {legal-character}
middle-name = {legal-character}
last-name = {legal-character}
legal-character = [A-Z|a-z|0-9|'|-| | ]


As you can see, the symbols look rather mathematical; you may be worried that it’s far too complicated to understand. As we will soon see, though, the notation is quite easy to read. The experience of several thousands of IT development projects and several tens of thousands of users has shown us that the notation is also quite understandable to almost all users if it is presented properly; we will discuss this in Section 10.3.


Definitions

A definition of a data element is introduced with the symbol “=”; in this context, the “=” is read as “is defined as,” or “is composed of,” or simply “means.” Thus, the notation


A = B + C


could be read in any of the following ways:


  • Whenever we say A, we mean a B and a C
  • A is composed of B and C
  • A is defined as B and C


To completely define a data element, our definition will include the following:


  • The meaning of the data element within the context of this user’s application. This is usually provided as a comment, using the “* *” notation.
  • The composition of the data element, if it is composed of meaningful elementary components.
  • The legal values that the data element can take on, if it is an elementary data element that cannot be decomposed any further.


Thus, if we were building a medical system that kept track of patients, we might define the terms weight and height in the following way:


  • weight = * patient’s weight upon admission to the hospital ** units: kilograms; range: 1-200*
  • height = * patient’s height upon admission to the hospital ** units: centimeters; range: 20-200*


Note that we have described the relevant units and the relevant range within matching “*” characters. Again, this is a notational convention that many IT organizations find useful, but it can be changed if necessary.

In addition to the units and range, you may also need to specify the accuracy or precision with which the data element is measured. For a data element like price, for example, it is important to indicate whether the values will be expressed in whole dollars, to the nearest penny, and so on.<ref>Not only that, we need to specify whether we’re dealing in U.S. dollars, Canadian dollars, Australian dollars, Hong Kong dollars, etc.</ref> And in many engineering and scientific applications, it is important to indicate the number of significant digits in the value of data elements.

Elementary Data Elements

Elementary data elements are those for which there is no meaningful decomposition in the context of the user’s environment. This is often a matter of interpretation and one that you must explore carefully with the user. For example, we have seen in the discussion above that the term name could be decomposed into last-name, first-name, middle-name, and courtesy-title. But perhaps in some user environments no such decomposition is necessary, relevant, or even meaningful (i.e., where the terms last-name, etc., have no meaning to the user).

When we have identified elementary data items, they must be entered in the data dictionary. As indicated above, the data dictionary should provide a brief narrative comment, enclosed within “*” characters, describing the meaning of the term within the user’s context. Of course, there will be some terms that are self-defining, that is, terms whose meaning is universally the same for all information systems, or where the systems analyst might agree that no further elaboration is necessary. For example, the following might be considered self-defining terms in a system that maintains information about people:


current-height
current-weight
date-of-birth
sex
home-phone-number


In these cases, no narrative comment is necessary; many systems analysts will use the notation “**” to indicate a “null comment” when the data element is self-defining. However, it is important to specify the values and units of measure that the elementary data item can take on. For example:


current-weight =
**
*units: pounds; range: 1-400*


current-height =
**
*units: inches; range: 1-96*


date-of-birth =
**
*units: days since Jan 1, 1900; range: 0-36500*


sex =
*values: [M | F]*


Optional Data Elements

An optional data element, as the phrase implies, is one that may or may not be present as a component of a composite data element. There are many examples of optional data elements in information systems:


  • A customer’s name may or may not include a middle name.
  • A customer’s street address may or may not include such secondary information as an apartment number.
  • A customer’s order may contain a billing address, a shipping address, or possibly both.


Situations like the last one must be carefully verified with the user and must be accurately documented in the data dictionary. For example, the notation


customer-address = (shipping-address) + (billing-address)


means, quite literally, that the customer-address might consist of:


  • just a shipping-address; or
  • just a billing-address; or
  • a shipping-address and a billing-address; or
  • neither a shipping-address nor a billing-address


This last possibility is rather dubious. It is far more likely that the user really means that the customer-address must consist of a shipping-address or a billing-address or both. This could be expressed in the following way:


customer-address = [shipping-address | billing-address | shipping-address + billing-address]


One could also argue that, in a mail-order business, one always needs a billing address to ensure that the order will be paid for; a separate shipping address (e.g., if the customer’s accounting department is in a separate location) is optional. Thus, it is possible that the user’s real business policy is better expressed by


customer-address = billing-address + (shipping-address)


But of course the only way to know this is to ask the user and to carefully explain the implications of the different notations shown above.<ref>There is one possibility that might explain the absence of both shipping address and billing address in a customer order: the walk-in customer who wishes to purchase an item and carry it away with him. It is likely that we would want to explicitly identify such a customer (by defining a new data element called walk-in that could have a value of true or false) because (1) walk-in customers may need to be treated differently (for example, their orders won’t have any shipping charges), and (2) it’s a good way to double-check and ensure that the missing shipping-address or billing-address was not a mistake.</ref>


Iteration

The iteration notation is used to indicate the repeated occurrence of a component of a data element. It is read as “zero or more occurrences of.” Thus, the notation


order = customer-name + shipping-address + { item }


means that an order must always contain a customer-name, and must always contain a shipping-address, and will also contain zero or more occurrences of an item. Thus, we may be dealing with a customer who places an order involving only one item or two items, or someone on a shopping binge who decides to order 397 different items.<ref>Keep in mind once again that we are defining the intrinsic business meaning of a data element without regard to the technology that will eventually be used to implement it. Eventually, for example, our systems designers are likely to ask for a reasonable upper limit on the number of different items that can be contained in a single order. “In order to make things work efficiently with our SUPERWHIZ database management system, we’ll have to restrict the number of items to 64. It’s unlikely that anyone would want to order more than 64 different items anyway, and if they do, they can simply place multiple orders.” And the user may have his own limitations, based on the paper forms or printed reports that he deals with; this is part of the user implementation model, which we will discuss in Chapter 21.</ref>

In many real-world situations, the user will want to specify upper and lower limits to the iteration. For instance, in the example above, the user will probably point out that it does not make sense for a customer to place an order with zero items; there must be at least one item in the order. And the user may want to specify an upper limit; perhaps 10 items is the most that will be allowed. We can indicate upper and lower limits in the following way:


order = customer-name + shipping-address + 1{item}10


It’s okay to specify just a lower limit, or just an upper limit, or both or neither. Thus, all of the following are allowable:


a = 1{b}
a = {b}10
a = 1{b}10
a = {b}


Selection

The selection notation indicates that a data element consists of exactly one of a set of alternative choices. The choices are enclosed by the square brackets “[” and “]” and separated by the vertical bar “|” character. Typical examples are:


sex = [Male | Female]
customer-type = [Government | Industry | University | Other]


It is important to review the selection choices with the user to ensure that all possibilities have been identified. In the last example, the user might tend to concentrate her or his attention on the “government,” “industry” and “university” customers, and might require some prodding to remember that some customers fall into the “none of the above” category.


Aliases

An alias, as the term implies, is an alternative name for a data element. It is a common occurrence when dealing with a diverse group of users, often in different departments or different geographical locations (and sometimes with different nationalities and different languages), who insist on using different names to mean the same thing. The alias is included in the data dictionary for completeness, and it is cross-referenced to the primary or official data name. For example:


client =
*alias for customer*


Note that the definition of client does not show the composition (i.e., it does not show that a client consists of a name, address, telephone number, etc.). All this detail should be provided only for the primary data name, in order to minimize the redundancy in the model.<ref>You may wish to ignore this advice if you are using a computerized data dictionary package that can manage and control the redundancy; however, this is fairly uncommon. The crucial thing to remember is that if we change the definition of a primary data element (e.g., if we decide that the definition of a customer should no longer include the phone-number) then the change must apply to all the aliases as well.</ref>

Even though the data dictionary correctly cross-references the aliases to the primary data name, you should avoid using aliases whenever possible. This is because the data names are usually first seen, and are most visible to all users, on the dataflow diagrams, where it may not be obvious that customer and client are aliases for one another. It is far better, if at all possible, to get the users to agree on one common name.<ref>An alternative is to annotate the flow on the dataflow diagram to indicate that it is an alias for something else; an asterisk, for example, could be appended to the end of alias names. For example, the notation client* could be used to indicate that client is an alias for something else. But even this is cumbersome.</ref>

Showing the data dictionary to the user

The data dictionary is created by the systems analyst during the development of the system model, but the user must be capable of reading and understanding the data dictionary in order to verify the model. This raises some obvious questions:


  • Will the users be able to understand the data dictionary notation?
  • How should the users verify that the dictionary is complete and correct?
  • How is the dictionary created?


The question of user acceptance of the dictionary notation is a “red herring” in most cases. Yes, the dictionary notation looks somewhat mathematical; but, as we have seen, the number of symbols that the user has to learn are very few. Users are accustomed to a variety of formal notations in their work and personal life; consider, for example, the notation for musical scores, which is far more complex.


Figure 10.1: Musical score notation


Similarly, the notation for bridge, chess, and a variety of other activities is at least as complex as that of the data dictionary notation shown in this chapter.


Figure 10.2: Chess notation


The question of user verification of the data dictionary usually leads to this question: “Should the users read through the entire dictionary, item by item, to ensure that it is correct?” It is difficult to imagine that any user would be willing to do this! It is more likely that the user will verify the correctness of the data dictionary in conjunction with the dataflow diagram, entity-relationship diagram, state-transition diagram, or process specification that he or she is reading.

There are a number of “correctness” issues that the systems analyst can carry out on his own, without the assistance of the user: he can ensure that the dictionary is complete, consistent, and non-contradictory. Thus, he can examine the dictionary on his own and ask the following questions:


  • Has every flow on the dataflow diagram been defined in the data dictionary?
  • Have all the components of composite data elements been defined?
  • Has any data element been defined more than once?
  • Has the correct notation been used for all data dictionary definitions?
  • Are there any data elements in the data dictionary that are not referenced in the dataflow diagrams, entity-relationship diagrams, or state-transition diagrams?


Implementation of the data dictionary

On a medium- or large-sized system, the data dictionary can represent a formidable amount of work. It is not uncommon to see a data dictionary with several thousand entries, and even a relatively simple system will have several hundred entries. Thus, some thought must be given to the way the dictionary will be developed, or the task is likely to overwhelm the systems analyst.

The easiest approach is to make use of an automated (computerized) facility to enter dictionary definitions, check them for completeness and consistency, and produce appropriate reports. If your organization is using any modern database management system (e.g., DB2, Oracle, Sybase, Microsoft Access), a dictionary facility is already available. In this case, you should take advantage of the facility and use it to build your data dictionary. However, beware of the following potential limitations:


  • You may be forced to limit your data names to a certain length (e.g., 15 or 32 characters). This probably won’t be a major problem, but you may find that your user may insist on a name such as destination-of-customer-shipment and that your data dictionary package forces you to abbreviate the name to dest-of-cust-ship.
  • Other artificial limitations may be placed on the name. For example, the hyphen character “-” may not be allowed, and you may be forced to use the underscore “_” character instead. Or you may be forced to prefix (or suffix) all your names with a project code indicating the name of the systems development project, leading to such names as


acct.pay.GHZ345P14.vendor_phone_number.


  • You may be forced to assign physical attributes (e.g., the number of bytes, or blocks of disk storage, or such data representations as packed decimal) to an item of data, even though it is not a matter of user policy. The data dictionary discussed in this chapter should be an analysis dictionary and should not require unnecessary or irrelevant implementation decisions.


Some systems analysts are also beginning to use automated toolkit packages that include graphic facilities for dataflow diagrams, and the like, as well as data dictionary capabilities. Again, if such a facility exists, you should make use of it. Automated toolkits are discussed in more detail in Appendix A.

If you have no automated facility for building the data dictionary, you should at least be able to use a conventional word-processing system to build a text file of data dictionary definitions. Or, if you have access to a personal computer, you can use any of the common file-management and database management programs (e.g., Microsoft Access for Windows-based computers, or FileMaker for Macintosh computers) to construct and manage your data dictionary.

Only in the most extreme case should you resort to a manual data dictionary, that is, separate, 3-by-5 index cards for each dictionary entry. This was often necessary, prior to the 1990s; even when PCs were already widely deployed, it was discouraging to see how many organizations kept their programmers and systems analysts in the Dark Ages. The cobbler’s children, as the saying goes, are usually the last to get shoes. But today, it is unforgivable; if you are working on a project where you do not have access to a data dictionary package or an automated analyst’s toolkit or a personal computer or a word processing system, then you should (1) quit and find a better job, or (2) get your own personal computer, or (3) both of the above.

Summary

Building a data dictionary is one of the more tedious, time-consuming aspects of systems analysis. But it is also one of the more important aspects: without a formal dictionary that defines the meaning of all the terms, there can be no hope for precision.

In the next chapter, we will see how to use the data dictionary and the dataflow diagram to build process specifications for each of the bottom-level processes.


REFERENCES

  1. J.D. Lomax, Data Dictionary Systems. Rochelle Park, N.J.: NCC Publications, 1977.
  2. Tom DeMarco, Structured Analysis and Systems Specification. New York: YOURDON Press, 1979.
  3. D. Kroenke, Database Processing. Chicago: Science Research Associates, 1977.
  4. Shaku Atre, Data Base: Structured Techniques for Design, Performance, and Management. New York: Wiley, 1980.


QUESTIONS AND EXERCISES

  1. Give a definition of data dictionary.
  2. Why is a data dictionary important in systems analysis?
  3. What information does a data dictionary provide about a data element?
  4. What is the meaning of the “=” notation in a data dictionary?
  5. What is the meaning of the “+” notation in a data dictionary?
  6. What is the meaning of the “( )” notation in a data dictionary?
  7. What is the meaning of the “{ }” notation in a data dictionary?
  8. What is the meaning of the “[ | | ]” notation in a data dictionary?
  9. Do you think the users you work with can understand the standard data dictionary notation provided in this chapter? If not, can you suggest an alternative?
  10. Give an example of an elementary data item.
  11. Give three examples of optional data elements.
  12. What are the possible meanings of the following:
    • (a) address = (city) + (state)
    • (b) address = street-address + city + (state) + (zipcode)
  13. Give an example of the use of the iteration {} notation.
  14. What is the meaning of each of the following notations:
    • (a) a = 1{b}
    • (b) a = {b}10
    • (c) a = 1{b}10
    • (d) a = 10{b}10
  15. Does it make sense to have an order defined in the following way? Why or why not?
    • order = customer-name + shipping-address + 6{item}
  16. Give an example of the selection (“[ ] ”) construct.
  17. What is the meaning of an alias in a data dictionary?
  18. Why should the use of aliases be minimized wherever possible?
  19. What kind of annotation can be used on a DFD to indicate that a data element is an alias?
  20. What are the three major issues when a user looks at a data dictionary?
  21. Do you think the users in your organization will be able to understand data dictionary notation?
  22. Do you think that the data dictionary notation shown in this chapter is more complex or less complex than musical notation?
  23. What are the three error-checking activities that the systems analyst can carry out on a data dictionary without the user?
  24. What are the likely limitations of an automated data dictionary package?
  25. Give a data dictionary definition of customer-name based on the following verbal specification from a user: “When we record a customer’s name, we’re very careful to include a courtesy title. This can be either “Mr.,” “Miss,” “Ms.,” “Mrs.,” or “Dr.” (There are lots of other titles like “Professor,“” “Sir,” etc., but we don’t bother with them.) Every one of our customers has a first name, but we allow a single initial if they prefer. Middle names are optional. And of course, the last name is required; we allow a pretty broad range of last names, including names that have hyphens (“Smith-Frisby,” for example) and apostrophes (“D’Arcy”) and so forth. We even allow an optional suffix, to allow for things like “Tom Smith, Jr.” or “Harvey Shmrdlu 3rd.”
  26. What is wrong with the following data dictionary definitions:
    • (a) a = b c d
    • (b) a = b + + c
    • (c) a = {b
    • (d) a = 4{b}3
    • (e) a = {x)
    • (f) x = ((y))
    • (g) p = 4{6{y}8}6
  27. In the hospital example of Section 9.2, what are the implications of the definition of height and weight? Comment: It would imply that we are only measuring in integral units and are not keeping track of fractional centimeters, and so on.
  28. Write a data dictionary definition of the information contained on your driver’s license. If you don’t have a driver’s license, find a friend who does.
  29. Write a data dictionary definition of the information contained on a typical bank credit card (e.g., MasterCard or Visa).
  30. Write a data dictionary definition of the information contained in a passport.
  31. Write a data dictionary definition of the information contained in a lottery ticket.


ENDNOTES

<references/>

Personal tools
Namespaces
Variants
Actions
Navigating this Wiki
Structured Analysis Wiki Tools