| Title: | Digital History Measures |
|---|---|
| Description: | Provides statistical functions to aid in the analysis of contemporary and historical corpora. These transparent functions may be useful to anyone, and were designed with the social sciences and humanities in mind. JSD (Jensen-Shannon Divergence) is a measure of the distance between two probability distributions. The JSD and Original JSD functions expand on existing functions, by calculating the JSD for distributions of words in text groups for all pairwise groups provided (Drost (2018) <doi:10.21105/joss.00765>). The Log Likelihood function is inspired by the work of digital historian Jo Guldi (Guldi (2022) <https://github.com/joguldi/digital-history>). Also includes helper functions that can count word frequency in each text grouping, and remove stop words. |
| Authors: | Ryan Schaefer [aut] (ORCID: <https://orcid.org/0000-0001-7694-3994>), Steph Buongiorno [aut, cre] (ORCID: <https://orcid.org/0000-0002-6965-0787>) |
| Maintainer: | Steph Buongiorno <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0 |
| Built: | 2026-05-20 09:54:55 UTC |
| Source: | https://github.com/stephbuon/dhmeasures |
Converts a data frame with columns for text and grouping variables into a data frame with each word and the count of each word in each group.
count_tokens(data, group = NA, text = "text")count_tokens(data, group = NA, text = "text")
data |
Data frame containing the raw data |
group |
The name of the column(s) containing the grouping variable. If not defined, the text will not be grouped. Can be given as either a string or a vector of strings. |
text |
The name of the column containing the text that needs to be tokenized |
Data frame containing columns for the word, the group(s) and the count labeled as 'word', the group name, and 'n'
test = data.frame ( myText = c( "Hello! This is the first sentence I am using to test this function!", "This is the second sentence!" ), myGroup = c( "group1", "group2" ) ) count_tokens(test, text = "myText", group = "myGroup")test = data.frame ( myText = c( "Hello! This is the first sentence I am using to test this function!", "This is the second sentence!" ), myGroup = c( "group1", "group2" ) ) count_tokens(test, text = "myText", group = "myGroup")
Provides statistical and other helper functions to aid in the analysis of historical corpuses.
The current statistical functions include Log Likelihood (log_likelihood), JSD (jsd) and Partial JSD (partial_jsd). The current helper functions include tokenize_counts.
Steph Buongiorno and Ryan Schaefer
Hansard corpus data for the decade 1820. This data has been pre-formatted to contain words counts by speaker. Stopwords have been removed from the data. To access the raw Hansard data, install the package hansardr. The variables are as follows:
hansard_1870_examplehansard_1870_example
A data frame with 482923 rows and 3 variables:
speaker The name of the speaker originally recorded in the transciptions of the debates.
word A word spoken by the given speaker.
n The number of times the given word was spoken by the given speaker
../data/hansard_1870_example.RData
Buongiorno, Steph; Kalescky, Robert; Godat, Eric; Cerpa, Omar Alexander; Guldi, Jo (2021)
data(hansard_1870_example)data(hansard_1870_example)
Calculates the JSD score for each word between group pairings. To use this function, the user must provide a data frame with a column for words, a column for the text group, and a column for the count of the word in that group. The default column names are "word", "group", and "n", but these can be changed using the parameters word, group, and n. The default settings will calculate the JSD for all words between the first two groups in the data set. However, the user can provide a list of words using the word_list parameter and/or a list of groups using the group_list parameter. If more than two groups are given, the function will provide the JSD scores all all pairs of groups.
jsd( text, group_list = as.character(c()), word_list = as.character(c()), group = "group", word = "word", n = "n" )jsd( text, group_list = as.character(c()), word_list = as.character(c()), group = "group", word = "word", n = "n" )
text |
Data frame containing data |
group_list |
Vector containing all groups to find pairwise JSD scores for |
word_list |
Vector containing all words to find JSD scores for |
group |
Name of data frame column containing text group |
word |
Name of data frame column containing words |
n |
Name of data frame column containing word count in text group |
Data frame containing a column containing unique words and columns for JSD scores for each group pair
# Load example Hansard 1870 dataset data(hansard_1870_example) head(hansard_1870_example) # Calculate JSD for given words and groups output = jsd( hansard_1870_example, group = "speaker", group_list = c("MR. GLADSTONE", "MR. DISRAELI"), word_list = c("trade", "press", "industry") ) head(output)# Load example Hansard 1870 dataset data(hansard_1870_example) head(hansard_1870_example) # Calculate JSD for given words and groups output = jsd( hansard_1870_example, group = "speaker", group_list = c("MR. GLADSTONE", "MR. DISRAELI"), word_list = c("trade", "press", "industry") ) head(output)
Calculates word distinctiveness using the log likelihood algorithm. You input a data frame with columns for the word, the text group, and the number of times that word appears in that group. The column names are set to "word", "group", and "n" by default but they can be changed using the parameters word, group, and n. If any of these columns are not found, the function will not work. The output will be a new data frame with a column called "word" containing all unique words and subsequent columns for all unique groups with the name of that group. The data frame will contain the log likelihood scores for each word in each group. The larger a log likelihood score is, the more distinctive that word is to that group.
log_likelihood( text, group_list = as.character(c()), word_list = as.character(c()), group = "group", word = "word", n = "n" )log_likelihood( text, group_list = as.character(c()), word_list = as.character(c()), group = "group", word = "word", n = "n" )
text |
Data frame containing data |
group_list |
Vector containing all groups to find log likelihood scores for |
word_list |
Vector containing all words to find log likelihood scores for |
group |
Name of data frame column containing text group |
word |
Name of data frame column containing words |
n |
Name of data frame column containing word count in text group |
Data frame containing a column containing unique words and columns for log likelihood scores for each group
# Load example Hansard 1820 dataset data(hansard_1870_example) head(hansard_1870_example) # Compute log likelihood output = log_likelihood( hansard_1870_example, group = "speaker", group_list = c("MR. GLADSTONE", "MR. DISRAELI"), word_list = c("trade", "press", "industry") ) head(output)# Load example Hansard 1820 dataset data(hansard_1870_example) head(hansard_1870_example) # Compute log likelihood output = log_likelihood( hansard_1870_example, group = "speaker", group_list = c("MR. GLADSTONE", "MR. DISRAELI"), word_list = c("trade", "press", "industry") ) head(output)
Calculates the JSD score between text groups. To use this function, the user must provide a data frame with a column for words, a column for the text group, and a column for the count of the word in that group. The default column names are "word", "group", and "n", but these can be changed using the parameters word, group, and n. The default settings will calculate the JSD for all words between the first two groups in the data set. However, the user can provide a list of words using the word_list parameter and/or a list of groups using the group_list parameter. If more than two groups are given, the function will provide the JSD scores all all pairs of groups.
original_jsd( text, group_list = as.character(c()), word_list = as.character(c()), group = "group", word = "word", n = "n" )original_jsd( text, group_list = as.character(c()), word_list = as.character(c()), group = "group", word = "word", n = "n" )
text |
Data frame containing data |
group_list |
Vector containing all groups to find pairwise JSD scores for |
word_list |
Vector containing all words that should be used to calculate JSD |
group |
Name of data frame column containing text group |
word |
Name of data frame column containing words |
n |
Name of data frame column containing word count in text group |
Data frame containing a column containing unique words and columns for JSD scores for each group pair
# Load example Hansard 1870 dataset data(hansard_1870_example) head(hansard_1870_example) # Calculate original JSD for given words and groups output = original_jsd( hansard_1870_example, group = "speaker", group_list = c("MR. GLADSTONE", "MR. DISRAELI"), word_list = c("trade", "press", "industry") ) head(output)# Load example Hansard 1870 dataset data(hansard_1870_example) head(hansard_1870_example) # Calculate original JSD for given words and groups output = original_jsd( hansard_1870_example, group = "speaker", group_list = c("MR. GLADSTONE", "MR. DISRAELI"), word_list = c("trade", "press", "industry") ) head(output)
Remove stop words and fully numeric words from a column in a dataframe that contains words. It is assumed that the text has been tokenized (see dhmeasures::tokenize_counts) prior to using this function.
remove_stop_words( data, words = "word", stop_words = dhmeasures::stop_word, remove_numbers = TRUE )remove_stop_words( data, words = "word", stop_words = dhmeasures::stop_word, remove_numbers = TRUE )
data |
Data frame containing your data |
words |
The name of the column where stop words should be searched for |
stop_words |
Vector of stop words. Uses dhmeasures::stop_word as the default |
remove_numbers |
Set to true (default) to remove all numeric values from the words column |
Data frame with all prior data, but without rows containing stop words
test = data.frame ( myText = c( "Hello! This is the first sentence I am using to test this function!", "This is the second sentence!" ), myGroup = c( "group1", "group2" ) ) test2 = count_tokens(test, text = "myText", group = "myGroup") test2 remove_stop_words(test2)test = data.frame ( myText = c( "Hello! This is the first sentence I am using to test this function!", "This is the second sentence!" ), myGroup = c( "group1", "group2" ) ) test2 = count_tokens(test, text = "myText", group = "myGroup") test2 remove_stop_words(test2)
A list of common words (known as stop words) that can be used to remove insignificant words from your data.
stop_wordstop_word
A character vector with 478 values
../data/stop_word.RData
Buongiorno, Steph (2021)
data(stop_word)data(stop_word)