FAQ - Démonstrateur

Qu'est qu'un démonstrateur web

Un démonstrateur permet de mettre en valeur des travaux de recherche.

In which format should my data be?

DyClee performs clustering exclusively on data stored in the .csv format (comma separated values).

Further constraints which your data must fit are:

1. The dataset must have a header which contains the column names (typically the names of sensors and a time column).
2. The first column of the dataset must be the time/timestamp column, and furthermore must be named time.
3. The dataset must be normalized.
4. There can't be any missing or NULL values in the dataset. Make sure your data is valid.

Should my data be normalized before clustering?

Yes, the data in the .csv file should be normalized before uploading it into the DyClee interface (Demos) so that the clustering can be executed. The algorithm won't work on data that isn't normalized, and if you attempt to upload a non-normalized dataset into the interface, the DyClee parser will report an error (It will print out the index of the first row where it catches non-normalized data).

At present there is no built-in preprocessing step in DyClee which normalized data for the user, so the user must manually do this.

How important is the choice of the parameter of window size?

While the window size has no effect on the result of the algorithm, meaning the final clusters, the choice of window size is important for the dynamic nature of the algorithm. It allows us to have multiple steps in the clustering process and visualize each step, allowing for dynamic clustering and the arrival of new data samples in each step. If the window size parameter is not set, the default value of the window size is the amount of samples that exist in the dataset, meaning only one step will take place and we lose the dynamic activity. For data which arrives more quickly, meaning the intervals between two samples are relatively short, smaller window size will be better and vice versa.

Where can I find examples of DyClee clustering process?

Under the tab Démos, examples of the usage of DyClee on datasets of different sizes and natures are available to see. There are 5 predetermined DyClee configurations, each ran on a specific dataset and the results of the DyClee run. The user can also alter these configurations as they like (changing one or more parameters) and run the Demo again, to see if there will be a difference in the final clustering.

What preprocessing steps do I need to do before clustering?

TODO (normalization, timestamp column)

Based on the given constraints, the user must execute the following preprocessing steps in order to use DyClee:

1. Add a header

Do I need to have a header in my .csv file?

Yes, it is necessary that your dataset has a header which contains the names of the columns in your dataset if you are using DyClee via our interface. This is necessary in order to parse the file properly and allow the you to choose which columns you would like to use in the clustering algorithm.

However, if you are using DyClee from the command line, the header is not necessary and you can simply name the indices of the columns you would like to use (except the first column, which is always chosen to be used in the algorithm as it is automatically considered as the time column).

Do I need to have a time column?

Yes, each dataset must contain a time column which represents the time when the sample was taken (an actual timestamp or an amount of time in seconds, minutes or hours), or it simply represents the order of arrival of samples (1, 2, 3, ...).

Furthermore, the time column must be the first column in the dataset and must accordingly be names "time". DyClee automatically takes the first column into account and marks is at the column containing time or timestamps, and the DyClee parser will report an error if the first column is not named "time". When you use the DyClee parser via our interface, the parser will print out all column names and you can make sure the first column is indeed time. Additionally, when you then choose which columns you wish to process using DyClee, you will not be able to check or uncheck the time column as it is automatically taken into account.

The time column, as previously stated, can contain time in any format including timestamps, integers (order of arrival) or floating point values (time measured in seconds).

How can I select which columns I want to use in the clustering?

With the DyClee parser which processes the file once you upload it to the interface, you will be able check off which columns you would like to use in the clustering algorithm (after successful parsing, the parser will provide the choice of columns based on the header in the .csv). If your dataset does not have a header which contains column names, the parser will instead return an error and you will not be able to proceed with the column choice and the clustering algorithm.