string basics作业 写作、 辅导R编程设计作业

” Homework 5 Due Saturday, Nov. 2, 2019
Some string basics
1a. Define two strings variables, equal to Statistical Computing and Statistical Computing, and
check whether they are equal. What do you conclude about the use of double versus single quotation
marks for creating strings in R? Give an example that shows why might we prefer to use double
quotation marks as the standard (think of apostrophes).
1b. Consider the string vector presidents of length 5 below, containing the last names of past US
presidents. Define a string vector first.letters to contain the first letters of each of these 5 last
names. Hint: use substr(), and take advantage of vectorization; this should only require one line
of code. Define first.letters.scrambled to be the output of sample(first.letters). Lastly,
reset the first letter of each last name stored in presidents according to the scrambled letters in
first.letters.scrambled. Hint: use substr() again, and take advantage of vectorization; this should
only take one line of code. Display these new last names.
presidents = c(Clinton, Bush, Reagan, Carter, Ford)
1c. Now consider the string phrase defined below. Using substr(), replace the first four characters in
phrase by Provide. Print phrase to the console, and describe the behavior you are observing. Using
substr() again, replace the last five characters in phrase by kit (dont use the length of phrase as
magic constant in the call to substr(), instead, compute the length using nchar()). Print phrase to
the console, and describe the behavior you are observing.
phrase = Give me a break
1d. Consider the string ingredients defined below. Using strsplit(), split this string up into a
string vector of length 5, with elements chickpeas, tahini, olive oil, garlic, and salt. Using
paste(), combine this string vector into a single string chickpeas + tahini + olive oil + garlic + salt.
Then produce a final string of the same format, but where the ingredients are sorted in alphabetical
(increasing) order.
ingredients = chickpeas, tahini, olive oil, garlic, salt
Shakespeares complete works
Project Gutenberg offers over 50,000 free online books, especially old books (classic literature), for which
copyright has expired. Were going to look at the complete works of William Shakespeare, taken from the
Project Gutenberg website.
To avoid hitting the Project Gutenberg server over and over again, weve grabbed a text file from them that
contains the complete works of William Shakespeare. You just need to skim through this text file a little bit
to get a sense of what it contains (a whole lot!).
Reading in text, basic exploratory tasks
2a. Read in the Shakespeare data linked above into your R session with readLines(). Call the result
shakespeare.lines. This should be a vector of strings, each element representing a line of text.
Print the first 5 lines. How many lines are there? How many characters in the longest line? What is the
average number of characters per line? How many lines are there with zero characters (empty lines)?
1
2b. Remove all empty lines from shakespeare.lines (i.e., lines with zero characters). Check that
that the new length of shakespeare.lines makes sense to you.
2c. Collapse the lines in shakespeare.lines into one big string, separating each line by a space in
doing so, using paste(). Call the resulting string shakespeare.all. How many characters does this
string have? How does this compare to the sum of characters in shakespeare.lines, and does this
make sense to you?
2d. Split up shakespeare.all into words, using strsplit() with split= . Call the resulting
string vector (note: here we are asking you for a vector, not a list) shakespeare.words. How long
is this vector, i.e., how many words are there? Using the unique() function, compute and store the
unique words as shakespeare.words.unique. How many unique words are there?
2e. Plot a histogram of the number of characters of the words in shakespeare.words.unique. You
will have to set a large value of the breaks argument (say, breaks=50) in order to see in more detail
what is going on. What does the bulk of this distribution look like to you? Why is the x-axis on the
histogram extended so far to the right (what does this tell you about the right tail of the distribution)?
2f. Reminder: the sort() function sorts a given vector into increasing order; its close friend, the
order() function, returns the indices that put the vector into increasing order. Using the order()
function, find the indices that correspond to the top 5 longest words in shakespeare.words.unique.
Then, print the top 5 longest words themselves. Do you recognize any of these as actual words?
Computing word counts
3a. Using table(), compute counts for the words in shakespeare.words, and save the result as
shakespeare.wordtab. How long is shakespeare.wordtab, and is this equal to the number of unique
words (as computed above)? Using named indexing, answer: how many times does the word thou
appear? The word rumour? The word gloomy? The word assassination?
3b. How many words did Shakespeare use just once? Twice? At least 10 times? More than 100 times?
3c. Sort shakespeare.wordtab so that its entries (counts) are in decreasing order, and save the result
as shakespeare.wordtab.sorted. Print the 25 most commonly used words, along with their counts.
What is the most common word? Second and third most common words?
3d. What you should have seen in the last question is that the most common word is the empty
string . This is just an artifact of splitting shakespeare.all by spaces, using strsplit(). Redefine
shakespeare.words so that all empty strings are deleted from this vector. Then recompute
shakespeare.wordtab and shakespeare.wordtab.sorted. Check that you have done this right by
printing out the new 25 most commonly used words, and verifying (just visually) that is overlaps with
your solution to the last question.
3e. produce a plot of the word counts (y-axis) versus the ranks (x-axis) in shakespeare.wordtab.sorted.
Set xlim=c(1,1000) as an argument to plot(); this restricts the plotting window to just the first
1000 ranks, which is helpful here to see the trend more clearly. Do you see Zipfs law in action, i.e.,
does it appear that Frequency C(1/Rank)
a
(for some C, a)? Challenge: either programmatically,
or manually, determine reasonably-well-fitting values of C, a for the Shakespeare data set; then draw
the curve y = C(1/x)
a on top of your plot as a red line to show how well it fits.
A tiny bit of regular expressions
4a. There are a couple of issues with the way weve built our words in shakespeare.words. The first
is that capitalization matters; from Q3c, you should have seen that and and And are counted
as separate words. The second is that many words contain punctuation marks (and so, arent really
2
words in the first place); to see this, retrieve the count corresponding to and, in your word table
shakespeare.wordtab. Fix these problems to define new words shakespeare.words.new. Then, delete
all empty strings from this vector, and compute word table from it, called shakespeare.wordtab.new.
4b. Compare the length of shakespeare.words.new to that of shakespeare.words; also compare
the length of shakespeare.wordtab.new to that of shakespeare.wordtab. Explain what you are
observing.
4c. Compute the unique words in shakespeare.words.new, calling the result shakespeare.words.new.unique.
Then repeat the queries in Q2e and Q2f on shakespeare.words.new.unique. Comment on the
histogramis it different in any way than before? How about the top 5 longest words?
4d. Sort shakespeare.wordtab.new so that its entries (counts) are in decreasing order, and save the
result as shakespeare.wordtab.sorted.new. Print out the 25 most common words and their counts,
and compare them (informally) to what you saw in Q3d. Also, produce a plot of the new word counts,
as you did in Q3e. Does Zipfs law look like it still holds?
Where are Shakespeares plays, in this massive text?
5a. Lets go back to shakespeare.lines. Take a look at lines 19 through 23 of this vector: you should
see a bunch of spaces preceding the text in lines 21, 22, and 23. Redefine shakespeare.lines by
setting it equal to the output of calling the function trimws() on shakespeare.lines. Print out lines
19 through 23 again, and describe whats happened.
5b. Open the file shakespeare.txt with your notebook application and just skim through this text
file. Near the top youll see a table of contents. Note that THE SONNETS is the first play, and
VENUS AND ADONIS is the last. Using which(), find the indices of the lines in shakespeare.lines
that equal THE SONNETS, report the index of the first such occurence, and store it as toc.start.
Similarly, find the indices of the lines in shakespeare.lines that equal VENUS AND ADONIS,
report the index of the first such occurence, and store it as toc.end.
5c. Define n = toc.end – toc.start + 1, and create an empty string vector of length n called
titles. Using a for() loop, populate titles with the titles of Shakespeares plays as ordered in
the table of contents list, with the first being THE SONNETS, and the last being VENUS AND
ADONIS. Print out the resulting titles vector to the console. Hint: if you define the counter variable
i in your for() loop to run between 1 and n, then you will have to index shakespeare.lines carefully
to extract the correct titles. Think about the following. When i=1, you want to extract the title of
the first play in shakespeare.lines, which is located at index toc.start. When i=2, you want to
extract the title of the second play, which is located at index toc.start + 1. And so on.
5d. Use a for() loop to find out, for each play, the index of the line in shakespeare.lines at which
this play begins. It turns out that the second occurence of THE SONNETS in shakespeare.lines
is where this play actually begins (this first ocurrence is in the table of contents), and so on, for
each play title. Use your for() loop to fill out an integer vector called titles.start, containing the
indices at which each of Shakespeares plays begins in shakespeare.lines. Print the resulting vector
titles.start to the console.
5e. Challenge Define titles.end to be an integer vector of the same length as titles.start, whose
first element is the second element in titles.start minus 1, whose second element is the third element
in titles.start minus 1, and so on. What this means: we are considering the line before the second
play begins to be the last line of the first play, and so on. Define the last element in titles.end to
be the length of shakespeare.lines. You can solve this question either with a for() loop, or with
proper indexing and vectorization.
5f. Challenge: its not really correct to set the last element in titles.end to be length of
shakespeare.lines, because there is a footer at the end of the Shakespeare data file. By look-
3
ing at the data file visually, come up with a way to programmatically determine the index of the last
line of the last play, and implement it.
5g. In Q5d, you should have seen that the starting index of Shakespeares 38th play THE TWO
NOBLE KINSMEN was computed to be NA, in the vector titles.start. Why? If you run
which(shakespeare.lines == THE TWO NOBLE KINSMEN) in your console, you will see that there
is only one occurence of THE TWO NOBLE KINSMEN in shakespeare.lines, and this occurs in
the table of contents. So there was no second occurence, hence the resulting NA value.
But now take a look at line 118,463 in shakespeare.lines: you will see that it is THE TWO NOBLE
KINSMEN:, so this is really where the second play starts, but because of colon : at the end of
the string, this doesnt exactly match the title THE TWO NOBLE KINSMEN, as we were looking
for. The advantage of using the grep() function, versus checking for exact equality of strings, is that
grep() allows us to match substrings. Specifically, grep() returns the indices of the strings in a vector
for which a substring match occurs, e.g.,
grep(pattern=cat,
x=c(cat, canned goods, batman, catastrophe, tomcat))
## [1] 1 4 5
so we can see that in this example, grep() was able to find substring matches to cat in the first, fourth,
and fifth strings in the argument x. Redefine titles.start by repeating the logic in your solution to
Q5d, but replacing the which() command in the body of your for() loop with an appropriate call to
grep(). Also, redefine titles.end by repeating the logic in your solution to Q5e. Print out the new
vectors titles.start and titles.end to the consolethey should be free of NA values.
Extracting and analysing a couple of plays
6a. Lets look at two of Shakespeares most famous tragedies. Programmatically find the index at which
THE TRAGEDY OF HAMLET, PRINCE OF DENMARK occurs in the titles vector. Use this
to find the indices at which this play starts and ends, in the titles.start and titles.end vectors,
respectively. Call the lines of text corresponding to this play shakespeare.lines.hamlet. How many
such lines are there? Do the same, but now for the play THE TRAGEDY OF ROMEO AND JULIET,
and call the lines of text corresponding to this play shakespeare.lines.romeo. How many such lines
are there?
6b. Repeat the analysis, outlined in Q4, on shakespeare.lines.hamlet. (This should mostly just
involve copying and pasting code as needed.) That is, to be clear:
collapse shakespeare.lines.hamlet into one big string, separated by spaces;
convert this string into all lower case characters;
divide this string into words, by splitting on spaces or on punctuation marks, using
split=[[:space:]]|[[:punct:]] in the call to strsplit();
remove all empty words (equal to the empty string ), and report how many words remain;
compute the unique words, report the number of unique words, and plot a histogram of their
numbers of characters;
report the 5 longest words;
compute a word table, and report the 25 most common words and their counts;
finally, produce a plot of the word counts verus rank.
6c. Repeat the same task as in the last part, but on shakespeare.lines.romeo. (Again, this should
just involve copying and pasting code as needed. P.S. Isnt this getting tiresome? Youll be happy when
we learn functions, next week!) Comment on any similarities/differences you see in the answers.
6d. Challenge. Using a for() loop and the titles.start, titles.end vectors constructed above,
answer the following questions. What is Shakespeares longest play (in terms of the number of words)?
4
What is Shakespeares shortest play? In which play did Shakespeare use his longest word (in terms of
the number of characters)? Are there any plays in which the is not the most common word?
Getting lines of text play-by-play
7a. Below is the get.wordtab.from.txt(). Modify this function so that the string vectors lines
and words are both included as named components in the returned list. For good practice, update
the documentation in comments to reflect your changes. Then call this function on the text file for
the Shakespeares complete works (with the rest of the arguments at their default values) and save
the result as shakespeare.wordobj. Using head(), display the first several elements of (definitely not
all of!) the lines, words, and wordtab components of shakespeare.wordobj, just to check that the
output makes sense to you.
# get.wordtab.from.txt: get a word table from text
# Inputs:
# – str.txt: string, specifying the file name
# – split: string, specifying what to split on. Default is the regex pattern
# [[:space:]]|[[:punct:]]
# – tolower: Boolean, TRUE if words should be converted to lower case before
# the word table is computed. Default is TRUE
# – keep.nums: Boolean, TRUE if words containing numbers should be kept in the
# word table. Default is FALSE
# Output: list, containing lines, words, word table, and some basic summaries
get.wordtab.from.txt = function(str.txt, split=[[:space:]]|[[:punct:]],
tolower=TRUE, keep.nums=FALSE) {
lines = readLines(str.txt)
text = paste(lines, collapse= )
words = strsplit(text, split=split)[[1]]
words = words[words != ]
# Convert to lower case, if were asked to
if (tolower) words = tolower(words)
# Get rid of words with numbers, if were asked to
if (!keep.nums)
words = grep([0-9], words, inv=TRUE, val=TRUE)
# Compute the word table
wordtab = table(words)
return(list(wordtab=wordtab,
number.unique.words=length(wordtab),
number.total.words=sum(wordtab),
longest.word=words[which.max(nchar(words))]))
}
7b. Go back and look Q5, where you located Shakespeares plays in the lines of text for Shakespeares
complete works. Set shakespeare.lines = shakespeare.wordobj$lines, and then rerun
your solution code (or the rerun the official solution code, if youd like) for Q5 on the lines of text
stored in shakespeare.lines. You should end up with two vectors titles.start and titles.end,
containing the start and end positions of each of Shakespeares plays in shakespeare.lines. Print out
titles.start and titles.end to the console.
5
7c. Create a list shakespeare.lines.by.play of length equal to the number of Shakespeares plays (a
number you should have already computed in the solution to the last question). Using a for() loop, and
relying on titles.start and titles.end, extract the appropriate subvector of shakespeare.lines
for each of Shakespeares plays, and store it as a component of shakespeare.lines.by.play.
That is, shakespeare.lines.by.play[[1]] should contain the lines for Shakespeares first play,
shakespeare.lines.by.play[[2]] should contain the lines for Shakespeares second play, and so on.
Name the components of shakespeare.lines.by.play according to the titles of the plays.
7d. Using one of the apply functions, along with head(), print the first 4 lines of each of Shakespeares
plays to the console (sorry graders . . . ). This should only require one line of code.
Getting word tables play-by-play
8a. Define a function get.wordtab.from.lines() to have the same argument structure as
get.wordtab.from.txt(), except that the first argument of get.wordtab.from.lines() should
be lines, a string vector passed by the user that contains lines of text to be processed. The
body of get.wordtab.from.lines() should be the same as get.wordtab.from.txt(), except
that lines is passed and does not need to be computed using readlines(). The output of
get.wordtab.from.lines() should be the same as get.wordtab.from.txt(), except that lines
does not need to be returned as a component. For good practice, incude documentation for your
get.wordtab.from.lines() function in comments.
8b. Using a for() loop or one of the apply functions (your choice here), run the get.wordtab.from.lines()
function on each of the components of shakespeare.lines.by.play, (with the rest of the arguments
at their default values). Save the result in a list called shakespeare.wordobj.by.play. That is,
shakespeare.wordobj.by.play[[1]] should contain the result of calling this function on the lines for
the first play, shakespeare.wordobj.by.play[[2]] should contain the result of calling this function
on the lines for the second play, and so on.
8c. Using one of the apply functions, compute numeric vectors shakespeare.total.words.by.play
and shakespeare.unique.words.by.play, that contain the number of total words and number of
unique words, respectively, for each of Shakespeares plays. Each vector should only require one line
of code to compute. Hint: [[() is actually a function that allows you to do extract a named
component of a list; e.g., try [[(shakespeare.wordobj, number.total.words), and youll see
this is the same as shakespeare.wordobj[[number.total.words]]; you should take advantage of
this functionality in your apply call. What are the 5 longest plays, in terms of total word count? The 5
shortest plays?
8d. Plot the number of unique words versus number of total words, across Shakeapeares plays. Set
the title and label the axes appropriately. Is there a consistent trend you notice?
Refactoring the word table functions
9. Look back at get.wordtab.from.lines() and get.wordtab.from.txt(). Note that they overlap
heavily, i.e., their bodies contain a lot of the same code. Redefine get.wordtab.from.txt() so that
it just calls get.wordtab.from.lines() in its body. Your new get.wordtab.from.txt() function
should have the same inputs as before, and produce the same output as before. So externally, nothing
will have changed; we are just changing the internal structure of get.wordtab.from.txt() to clean
up our code base (specifically, to avoid code duplication in our case). This is an example of code
refactoring.
Call your new get.wordtab.from.txt() function on the txt for Shakespeares complete works, saving
the result as shakespeare.wordobj2. Compare some of the components of shakespeare.wordobj2 to
6
those of shakespeare.wordobj (which was computed using the old function definition) to check that
your new implementation works as it should.
Challenge. Check using all.equal() whether shakespeare.wordobj and shakespeare.wordobj2
are the same. Likely, this will not return TRUE. (If it does, then youve already solved this challenge
question!) Modify your get.wordtab.from.txt() function from the last question, so that it still calls
get.wordtab.from.lines() to do the hard work, but produces an output exactly the same as the
original shakespeare.wordobj object. Demonstrate your suc”

添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导