MLearn – Data

MLearn data tab is used load data from local drive’s .csv file. Maximum file size is 5 megabyte. Default data can be used for demonstration purposes. Click dropdown under ‘load data from app’ to choose a default dataset.

Click ‘Browse’ to load data from local drive. Select .csv file that contains your data and select ‘choose’ and ‘Submit Change’ to load the data.

Once loading completed, it’ll appear under ‘Load Data’ section.

New data will appear in a table in ‘Data Input’ section.

A portion of data can be used for analysis Instead of using all data. This can be done by using rows, columns or date, if there’s any date column. For example, first 5 rows can be selected for analysis by checking ‘Subset Row’ under ‘Extract Data’ and entering beginning and ending row number as 1 and 5, successively.

Click ‘Submit Change’ to end this action.

In a similar manner specific columns can be selected for further analysis. As an alternate to subset data by rows, data can be extracted by date, if there’s any date column. To do so, first uncheck and click ‘Submit Change’ to bring back original dataset. Check ‘Subset Date’, select ‘date column’, ‘date format’ and finally ‘start date’ and ‘end date’. For this example, date column is ‘Date’, formate is ‘month/date/year’ and to select first 5 rows choose start date ‘2012-10-04’ and end date ‘2012-10-10’.

Click ‘Submit Change’ to end this action.

If there’s any missing value, error or NA in dataset, they’ll be converted to NA (not a number). This can be handled either by removing NA containing rows or replacing those cells with column mean. For demonstration, refresh the page by Clicking ‘Refresh’ on right end. This’ll change the page to default settings. Under ‘load data from app’, select ‘airquality’ data and click ‘Submit Change’. 

In first 10 rows there’re missing values at 5th, 6th and 10th rows. To remove NA rows, select ‘remove NA rows’ radio button and click ‘Submit Change’.

This’ll remove all NA containing rows.

Other option is to select ‘column mean’ under ‘replace NA with’ and click ‘Submit Change’.

For regression and classification operation, a train/test split can be selected under ‘Split Data’ section. By default train/test split is 80/20, that is in 100 rows dataset, 80 rows will be used to train the model and 20 rows will be used to test the model. Split will be preformed preserving relative ratios of different labels in response variable (Y-variable).