In this chapter, we shall discuss in detail about datasets.
CSV files
As we know that CSV (Comma Separated Value) file is a plain text file which uses commas to separate fields and values of those fields. The extension of these files is .CSV. We have various methods provided by Julia programming language to perform operations on CSV files.
Import a .CSV file in Julia
To import a .CSV file, we need to install CSV package. Use the following command to do so −
using pkgpkg.add("CSV")
Reading data
To read data from a CSV file in Julia we need to use read() method from CSV package as follows −
julia> using CSVjulia> CSV.read("C://Users//Leekha//Desktop//Iris.csv")150×6 DataFrame│ Row │ Id │ SepalLengthCm │ SepalWidthCm │ PetalLengthCm │ PetalWidthCm │ Species ││ │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ String │├─────┼───────┼───────────────┼──────────────┼───────────────┼──────────────┼─────────-------┤│ 1 │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa ││ 2 │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa ││ 3 │ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa ││ 4 │ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa ││ 5 │ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa ││ 6 │ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa ││ 7 │ 7 │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ Iris-setosa ││ 8 │ 8 │ 5.0 │ 3.4 │ 1.5 │ 0.2 │ Iris-setosa ││ 9 │ 9 │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ Iris-setosa ││ 10 │ 10 │ 4.9 │ 3.1 │ 1.5 │ 0.1 │ Iris-setosa │⋮│ 140 │ 140 │ 6.9 │ 3.1 │ 5.4 │ 2.1 │ Iris-virginica ││ 141 │ 141 │ 6.7 │ 3.1 │ 5.6 │ 2.4 │ Iris-virginica ││ 142 │ 142 │ 6.9 │ 3.1 │ 5.1 │ 2.3 │ Iris-virginica ││ 143 │ 143 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica ││ 144 │ 144 │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ Iris-virginica ││ 145 │ 145 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica ││ 146 │ 146 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica ││ 147 │ 147 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica ││ 148 │ 148 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica ││ 149 │ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica ││ 150 │ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │
Creating new CSV file
To create new CSV file, we need to use touch()command from CSV package. We also need to use DataFrames package to write the newly created content to new CSV file −
julia> using DataFramesjulia> using CSVjulia> touch("1234.csv")"1234.csv"julia> new = open("1234.csv", "w")IOStream(<file 1234.csv>)julia> new_data = DataFrame(Name = ["Gaurav", "Rahul", "Aarav", "Raman", "Ravinder"], RollNo = [1, 2, 3, 4, 5], Marks = [54, 67, 90, 23, 95]) 5×3 DataFrame│ Row │ Name │ RollNo │ Marks ││ │ String │ Int64 │ Int64 │├─────┼──────────┼────────┼───────┤│ 1 │ Gaurav │ 1 │ 54 ││ 2 │ Rahul │ 2 │ 67 ││ 3 │ Aarav │ 3 │ 90 ││ 4 │ Raman │ 4 │ 23 ││ 5 │ Ravinder │ 5 │ 95 │julia> CSV.write("1234.csv", new_data)"1234.csv"julia> CSV.read("1234.csv")5×3 DataFrame│ Row │ Name │ RollNo │ Marks ││ │ String │ Int64 │ Int64 │├─────┼──────────┼────────┼───────┤│ 1 │ Gaurav │ 1 │ 54 ││ 2 │ Rahul │ 2 │ 67 ││ 3 │ Aarav │ 3 │ 90 ││ 4 │ Raman │ 4 │ 23 ││ 5 │ Ravinder │ 5 │ 95 │
HDF5
The full form of HDF5 is Hierarchical Data Format v5. Following are some of its properties −
-
A “group” is similar to a directory, a “dataset” is like a file.
-
To associate metadata with a particular group, it uses attributes.
-
It uses ASCII names for different objects.
-
Language wrappers are often known as “low level” or “high level”.
Opening HDF5 files
HDF5 files can be opened with h5open command as follows −
fid = h5open(filename, mode)
Following table describes the mode −
Sl.No | Mode & Meaning |
---|---|
1 |
“r” read-only |
2 |
“r+” read-write − It will preserve any existing contents. |
3 |
“cw” read-write − It will create file if not existing. It will also preserve existing contents. |
4 |
“w” read-write − It will destroy any existing contents. |
The above command will produce an object of type HDF5File and a subtype of the abstract type DataFile.
Closing HDF5 files
Once finished with a file, we should close it as follows −
close(fid)
It will also close all the objects in the file.
Opening HDF5 objects
Suppose if we have a file object named fid and it has a group called object1, it can be opened as follows −
Obj1 = fid[“object1”]
Closing HDF5 objects
close(obj1)
Reading data
A group “g” containing a dataset with path “dtset” and we have opened dataset as dset1 = g[dtset]. We can read the information in following ways −
ABC = read(dset1)ABC = read(g, "dtset")Asub = dset1[2:3, 1:3]
Writing data
We can create the dataset as follows −
g["dset1"] = rand(3,5)write(g, "dset1", rand(3,5))
XML files
Here we will be discussing about LightXML.jl package which is a light-weight Julia wrapper for libxml2. It provides the following functionalities −
-
Parsing an XML file
-
Accessing XML tree structure
-
Creating an XML tree
-
Exporting an XML tree to a string
Example
Suppose we have an xml file named new.xml as follows −
<Hello> <to>Gaurav</to> <from>Rahul</from> <heading>Reminder to meet</heading> <body>Friend, Don''t forget to meet this weekend!</body></Hello>
Now, we can parse this file by using LightXML as follows −
julia> using LightXML#below code will parse this xml filejulia> xdoc = parse_file("C://Users//Leekha//Desktop//new.xml")<?xml version="1.0" encoding="utf-8"?><Hello><to>Gaurav</to><from>Rahul</from><heading>Reminder to meet</heading><body>Friend, Don''t forget to meet this weekend!</body></Hello>
Following example explains how to get the root element −
julia> xroot = root(xdoc);julia> println(name(xroot))Hello#Traversing all the child nodes and also print element namesjulia> for c in child_nodes(xroot) # c is an instance of XMLNode println(nodetype(c)) if is_elementnode(c) e = XMLElement(c) # this makes an XMLElement instance println(name(e)) end end31to31from31heading31body3
RDatasets
Julia has RDatasets.jl package providing easy way to use and experiment with most of the standard data sets which are available in the core of R. To load and work with one of the datasets included in RDatasets packages, we need to install RDatasets as follows −
julia> using Pkgjulia> Pkg.add("RDatasets")
Subsetting the data
For example, we will use the Gcsemv dataset in mlmRev group as follows −
julia> GetData = dataset("mlmRev","Gcsemv");julia> summary(GetData);julia> head(GetData)6×5 DataFrame│ Row │ School │ Student │ Gender │ Written │ Course ││ │ Categorical… │ Categorical… │ Categorical… │ Float64⍰ │ Float64⍰ │├─────┼──────────────┼──────────────┼──────────────┼──────────┼──────────┤│ 1 │ 20920 │ 16 │ M │ 23.0 │ missing ││ 2 │ 20920 │ 25 │ F │ missing │ 71.2 ││ 3 │ 20920 │ 27 │ F │ 39.0 │ 76.8 ││ 4 │ 20920 │ 31 │ F │ 36.0 │ 87.9 ││ 5 │ 20920 │ 42 │ M │ 16.0 │ 44.4 ││ 6 │ 20920 │ 62 │ F │ 36.0 │ missing │
We can select the data for a particular school as follows −
julia> GetData[GetData[:School] .== "68137", :]104×5 DataFrame│ Row │ School │ Student │ Gender │ Written │ Course ││ │ Categorical… │ Categorical… │ Categorical… │ Float64⍰ │ Float64⍰ │├─────┼──────────────┼──────────────┼──────────────┼──────────┼──────────┤│ 1 │ 68137 │ 1 │ F │ 18.0 │ 56.4 ││ 2 │ 68137 │ 2 │ F │ 23.0 │ 55.5 ││ 3 │ 68137 │ 3 │ F │ 25.0 │ missing ││ 4 │ 68137 │ 4 │ F │ 29.0 │ 73.1 ││ 5 │ 68137 │ 5 │ F │ missing │ 66.6 ││ 6 │ 68137 │ 9 │ F │ 20.0 │ 60.1 ││ 7 │ 68137 │ 11 │ F │ 34.0 │ 63.8 ││ 8 │ 68137 │ 12 │ F │ 60.0 │ 89.8 ││ 9 │ 68137 │ 13 │ F │ 44.0 │ 76.8 ││ 10 │ 68137 │ 14 │ F │ 20.0 │ 58.3 │⋮│ 94 │ 68137 │ 252 │ M │ missing │ 75.9 ││ 95 │ 68137 │ 254 │ M │ 35.0 │ missing ││ 96 │ 68137 │ 255 │ M │ 36.0 │ 62.0 ││ 97 │ 68137 │ 258 │ M │ 23.0 │ 61.1 ││ 98 │ 68137 │ 260 │ M │ 25.0 │ missing ││ 99 │ 68137 │ 261 │ M │ 46.0 │ 89.8 ││ 100 │ 68137 │ 264 │ M │ 50.0 │ 70.3 ││ 101 │ 68137 │ 268 │ M │ 15.0 │ 43.5 ││ 102 │ 68137 │ 270 │ M │ missing │ 73.1 ││ 103 │ 68137 │ 272 │ M │ 43.0 │ 78.7 ││ 104 │ 68137 │ 273 │ M │ 35.0 │ 60.1 │
Sorting the data
With the help of sort!() function, we can sort the data. For example, here we will sort the dataset in ascending examination scores −
julia> sort!(GetData, cols=[:Written])1905×5 DataFrame│ Row │ School │ Student │ Gender │ Written │ Course ││ │ Categorical… │ Categorical… │ Categorical… │ Float64⍰ │ Float64⍰ │├──────┼──────────────┼──────────────┼──────────────┼──────────┼──────────┤│ 1 │ 22710 │ 77 │ F │ 0.6 │ 41.6 ││ 2 │ 68137 │ 65 │ F │ 2.5 │ 50.0 ││ 3 │ 22520 │ 115 │ M │ 3.1 │ 9.25 ││ 4 │ 68137 │ 80 │ F │ 4.3 │ 50.9 ││ 5 │ 68137 │ 79 │ F │ 7.5 │ 27.7 ││ 6 │ 22710 │ 57 │ F │ 11.0 │ 73.1 ││ 7 │ 64327 │ 19 │ F │ 11.0 │ 87.0 ││ 8 │ 68137 │ 85 │ F │ 11.0 │ 27.7 ││ 9 │ 68137 │ 97 │ F │ 11.0 │ 57.4 ││ 10 │ 68137 │ 100 │ F │ 11.0 │ 61.1 │⋮│ 1895 │ 74874 │ 83 │ F │ missing │ 81.4 ││ 1896 │ 74874 │ 86 │ F │ missing │ 92.5 ││ 1897 │ 76631 │ 79 │ F │ missing │ 84.2 ││ 1898 │ 76631 │ 193 │ M │ missing │ 72.2 ││ 1899 │ 76631 │ 221 │ F │ missing │ 76.8 ││ 1900 │ 77207 │ 5001 │ F │ missing │ 82.4 ││ 1901 │ 77207 │ 5062 │ M │ missing │ 75.0 ││ 1902 │ 77207 │ 5063 │ F │ missing │ 79.6 ││ 1903 │ 84772 │ 17 │ M │ missing │ 88.8 ││ 1904 │ 84772 │ 49 │ M │ missing │ 74.0 ││ 1905 │ 84772 │ 85 │ F │ missing │ 90.7 │
Statistics in Julia
To work with statistics, Julia has StatsBase.jl package providing easy way to do simple statistics. To work with statistics, we need to install StatsBase package as follows −
julia> using Pkgjulia> Pkg.add("StatsBase")
Simple Statistics
Julia provides methods to define weights and calculate mean.
We can use weights() function to define weights vectors as follows −
julia> WV = Weights([10.,11.,12.])3-element Weights{Float64,Float64,Array{Float64,1}}: 10.0 11.0 12.0
You can use the isempty() function to check whether the weight vector is empty or not −
julia> isempty(WV)false
We can check the type of weight vectors with the help of eltype() function as follows −
julia> eltype(WV)Float64
We can check the length of the weight vectors with the help of length() function as follows −
julia> length(WV)3
There are different ways to calculate the mean −
-
Harmonic mean − We can use harmmean() function to calculate the harmonic mean.
julia> A = [3, 5, 6, 7, 8, 2, 9, 10]8-element Array{Int64,1}: 3 5 6 7 8 2 9 10julia> harmmean(A)4.764831009217679
-
Geometric mean − We can use geomean() function to calculate the Geometric mean.
julia> geomean(A)5.555368605381863
-
General mean − We can use mean() function to calculate the general mean.
julia> mean(A)6.25
Descriptive Statistics
It is that discipline of statistics in which information is extracted and analyzed. This information explains the essence of data.
Calculating variance
We can use var() function to calculate the variance of a vector as follows −
julia> B = [1., 2., 3., 4., 5.];julia> var(B)2.5
Calculating weighted variance
We can calculate the weighted variance of a vector x w.r.t to weight vector as follows −
julia> B = [1., 2., 3., 4., 5.];julia> a = aweights([4., 2., 1., 3., 1.])5-element AnalyticWeights{Float64,Float64,Array{Float64,1}}: 4.0 2.0 1.0 3.0 1.0julia> var(B, a)2.066115702479339
Calculating standard deviation
We can use std() function to calculate the standard variation of a vector as follows −
julia> std(B)1.5811388300841898
Calculating weighted standard deviation
We can calculate the weighted standard deviation of a vector x w.r.t to weight vector as follows −
julia> std(B,a)1.4373989364401725
Calculating mean and standard deviation
We can calculate the mean and standard deviation in a single command as follows −
julia> mean_and_std(B,a)(2.5454545454545454, 1.4373989364401725)
Calculating mean and variance
We can calculate the mean and variance in a single command as follows −
julia> mean_and_var(B,a)(2.5454545454545454, 2.066115702479339)
Samples and Estimations
It may be defined as the discipline of statistics where, for analysis, sample units will be selected from a large population set.
Following are the ways in which we can do sampling −
Taking random samples is the simplest way of doing sampling. In this we draw a random element from the array, i.e., the population set. The function for this purpose is sample().
Example
julia> A = [8.,12.,23.,54.5]4-element Array{Float64,1}: 8.0 12.0 23.0 54.5julia> sample(A)12.0
Next, we can take “n” elements as random samples.
Example
julia> A = [8.,12.,23.,54.5]4-element Array{Float64,1}: 8.0 12.0 23.0 54.5julia> sample(A, 2)2-element Array{Float64,1}: 23.0 54.5
We can also write the sampled elements to pre-allocated elements of length “n”. The function to do this task is sample!().
Example
julia> B = [1., 2., 3., 4., 5.];julia> X = [2., 1., 3., 2., 5.];julia> sample!(B,X)5-element Array{Float64,1}: 2.0 2.0 4.0 1.0 3.0
Another way is to do direct sampling which will randomly picks the numbers from a population set and stores them in another array. The function to do this task is direct_sample!().
Example
julia> StatsBase.direct_sample!(B, X)5-element Array{Float64,1}: 1.0 4.0 4.0 4.0 5.0
Knuth’s algorithms is one other way in which random sampling is done without replcement.
Example
julia> StatsBase.knuths_sample!(B, X)5-element Array{Float64,1}: 5.0 3.0 4.0 2.0 1.0