Articles

1: Introducing Data


1: Introducing Data

1.1 Migraine and accupuncture. A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture. To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 females diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control. 43 patients in the treatment group received acupuncture that is specifically designed to treat migraines. 46 patients in the control group received placebo acupuncture (needle insertion at nonacupoint locations). 24 hours after patients received acupuncture, they were asked if they were pain free. Results are summarized in the contingency table below. 52

Figure from the original paper displaying the appropriate area (M) versus the inappropriate area (S) used in the treatment of migraine attacks.

  1. What percent of patients in the treatment group were pain free 24 hours after receiving acupuncture? What percent in the control group?
  2. At first glance, does acupuncture appear to be an effective treatment for migraines? Explain your reasoning.
  3. Do the data provide convincing evidence that there is a real pain reduction for those patients in the treatment group? Or do you think that the observed difference might just be due to chance?

1.2 Sinusitis and antibiotics, Part I. Researchers studying the effect of antibiotic treatment for acute sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with acute sinusitis to one of two groups: treatment or control. Study participants received either a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste. The placebo consisted of symptomatic treatments such as acetaminophen, nasal decongestants, etc. At the end of the 10-day period patients were asked if they experienced signi cant improvement in symptoms. The distribution of responses are summarized below. 53

  1. What percent of patients in the treatment group experienced a significant improvement in symptoms? What percent in the control group?
  2. At first glance, which treatment appears to be more effective for sinusitis?
  3. Do the data provide convincing evidence that there is a difference in the improvement rates of sinusitis symptoms? Or do you think that the observed difference might just be due to chance?

52 G. Allais et al. "Ear acupuncture in the treatment of migraine attacks: a randomized trial on the efficacy of appropriate versus inappropriate acupoints". In: Neurological Sciences 32.1 (2011), pp. 173-175.

53 J.M. Garbutt et al. "Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial". In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685<692.


1.2 How this book is organised

The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you’ll iterate through them multiple times). In our experience, however, this is not the best way to learn them:

Starting with data ingest and tidying is sub-optimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualisation and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.

Some topics are best explained with other tools. For example, we believe that it’s easier to understand how models work if you already know about visualisation, tidy data, and programming.

Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems. We’ll give you a selection of programming tools in the middle of the book, and then you’ll see how they can combine with the data science tools to tackle interesting modelling problems.

Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you’ve learned. While it’s tempting to skip the exercises, there’s no better way to learn than practicing on real problems.


  • Reduction in duplication and waste created by information silos
  • Increased data sharing through improved trust and standardisation
  • Reduction in costs by improving resource and process efficiencies
  • Reduction in time spent by employees finding, acquiring and processing data
  • Reduction of risk and costs as data is better managed to support regulatory compliance
  • More robust consideration of ethical and privacy issues to avoid reputational damage

Guiding principles of data governance

The NSW Information Management Framework principles should guide agencies in governing and managing their data:


Customer reviews

Top reviews from the United States

There was a problem filtering reviews right now. Please try again later.

Loved this book! If I could have given 6 stars, I would have.

This book would provide you with a very well rounded approach to Data Science and by that I mean truly would give you a ride though all the aspects of this field versus showing you some regression algorithm using python and call it Data Science.

Book has it all - not only it leverages probably the most favorite language (python) for its examples, it also goes in details in supporting tools and eco systems. For examples, Spark - Why create something when Spark is already here and we can just use it in our work.

It covered NoSQL technologies to give readers enough information to get started and weighted pros and cons of each. I especially enjoyed reading ACID, BASE and CAP theorem sections. I am familiar with them and gave presentation on exact same topic few years ago and I enjoyed the read since it covered the important key points leaving me with nice warm feeling in my stomach that unaware readers will be in a good hands!

During discussion of NoSQL, ElasticSearch was introduced and entire chapter was devoted on how to leverage search capabilities to provide us with valuable results. Search is something that ElasticSearch does best! Section about Damerau-Levenshtein was great. It made you think of dirty data that is present in the real world and how you deal with it (vs giving you example with perfectly clean and ready to use data)

Speaking of real world experience - this book took a step back and instead of trying to be data science book and throwing cool python libraries at you, it talked about general approach in the real word when you deal with data science projects by trying to make you think of project's research goals - Why are we doing this? This was done to help you think and to help you pick the right solutions.

Another example of real world problems was their chapter on dealing with big and i mean truly big data. In some sample program, you can surely play with tens of hundreds of sample records, but what do you do with gigs or more of data? while running production servers, you are not dealing with 2-3 lines of log entries, you deal sometimes with gigs! So I was very happy to see section that talked on how you can tackle problems like that.

Authors did a great job in my opinion by cloning and making it available pywebhdfs package that would work with their example of the code (they did use now outdated hortonworks sandbox that made it hard to follow in few chapters, but it was not hard to figure out where menus/buttons were moved)

A nice final touch that I felt was great was section on results visualization. How would you communicate what you found to others? will you point them at some hard to read print out, OR shows them a picture/graph that makes your findings easy to read?

So. many many gems in this book that would really give you a great overview of the field of data science and would get you started not only in strictly academic / demo only way, but also in real life production environment.

I definitely would be re-reading this book and recommending it to my colleagues!


A little more on subsetting

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 30 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and | . The & is read “and” so that

will give you the data for men over the age of 30. The | character is read “or” so that

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

1.1 Introducing The Graph Database

For most types of data storage, there is the concept of some elements of data (whether they be for example data nodes or data tables) having more precedence, or importance, over other elements.

For example, take an XML document. An XML document typically contains nodes of information each with a parent node. At the root of the document is the highest level node, which has no parent.

Take a look at the illustration above. In a data graph, there is no concept of roots (or a hierarchy). A graph consists of resources related to other resources, with no single resource having any particular intrinsic importance over another.

An Example Of A Data Graph

It's easiest first to look at a series of statements about how things relate to each other and to visualize these as a graph before looking at how these relationships might be expressed in RDF. Look at the following statements describing the relationship between a dog (called Bengie) and a cat (called Bonnie):

Bengie is a dog.
Bonnie is a cat.
Bengie and Bonnie are friends.

Using these three simple statements, let's turn this into a data graph:

The relationships implied by this graph are fairly intuitive but to be thorough let's review them. We can can see that our two things - identified by "Thing 1" and "Thing 2" - have the properties name, animalType and friendsWith.

From this, we can see that "Thing 1"'s name is Bengie, and "Thing 2"'s name is Bonnie. "Thing 1" is a dog, and "Thing 2" is a cat. And finally, both are friends with each other (implied by the friendsWith property pointing in both directions).

Important Point The arrows in the above diagram are properties, sometimes in RDF terminology called predicates. Remember for now that the terms property and predicate are interchangable, and that it is the arrows that describe the properties in the graph.

Before formally introducing simple RDF, let's give a quick example to give you a flavor of what it looks like.

Includes all our primer tutorials. Plus two exclusive new tutorials on RDF syntaxes, and NoSQL databases found only in the e-Book.


1.4 On what type media can Recover My Files be used?

Recover My Files will work on all types of computer storage media. This includes:

  • Hard drives, including external USB drives
  • USB sticks, Thumb Drives, Pen drives or other USB media
  • Camera cards
  • Hardware and software RAID (JBOD, RAID 0,1,5)
  • iPods, MP3 players and Dictaphones

Or any other storage device which is shown under windows as a hard drive (Recover My Files does NOT support recovery from iPhone or iPad hard drives as Apple restrict access to these devices).


A Data Science Profile

In the class, Rachel handed out index cards and asked everyone to profile themselves (on a relative rather than absolute scale) with respect to their skill levels in the following domains:

Communication and presentation skills

As an example, Figure 1-2 shows Rachel’s data science profile.

Figure 1-2. Rachel’s data science profile, which she created to illustrate trying to visualize oneself as a data scientist she wanted students and guest lecturers to “riff” on this—to add buckets or remove skills, use a different scale or visualization method, and think about the drawbacks of self-reporting

We taped the index cards to the blackboard and got to see how everyone else thought of themselves. There was quite a bit of variation, which is cool—lots of people in the class were coming from social sciences, for example.

Where is your data science profile at the moment, and where would you like it to be in a few months, or years?

As we mentioned earlier, a data science team works best when different skills (profiles) are represented across different people, because nobody is good at everything. It makes us wonder if it might be more worthwhile to define a “data science team”—as shown in Figure 1-3—than to define a data scientist.

Figure 1-3. Data science team profiles can be constructed from data scientist profiles there should be alignment between the data science team profile and the profile of the data problems they try to solve

Lesson 1: Zoo data

The term 'data' is introduced through an animal-themed activity that involves identifying the number of animals at a zoo and developing visual ways to represent the numbers

To represent data in different ways

Lesson 2: Picture data

Use of online software to represent visually the zoo animals data from the previous lesson to develop and create a pictogram or chart

To use technology to represent data in different ways

Lesson 3: Minibeast hunt

Using an area of the school, go on a minibeast hunt and use the data collected to create a visual representation of the data, such as a chart or pictogram, with the use of a computer


Watch the video: 1. Εισαγωγή στο jamovi (December 2021).