Tree Clustering Algorithm Project

Project Overview

The objectives of this project were to:

Create a clustering algorithm for tree maps based on multiple factors.
Enable forest owners to analyze tree samples from specific regions, such as dense or mixed wood areas.
Collaborate closely with forest experts to determine optimal thresholds for clustering.
Work with both company-specific and public datasets to implement the algorithm for a forest management tool.

This clustering approach is crucial in forest management, as it helps forest owners understand tree distribution, identify areas of interest, and manage resources effectively. This project was executed in close coordination with experts and aims to contribute to a robust forest management software by the Liechtenstein Group and Forest Mapping LLC.

Learn more about the forest management tool: FMM Forest Mapping

Determining the Optimal Number of Clusters

To determine the optimal number of clusters, I used silhouette scores, which measure how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters. Here’s the approach used:


# Determine the optimal number of clusters
silhouette_scores = []
for n_clusters in range(2, 11):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(weighted_features)  # Use weighted features
    silhouette_avg = silhouette_score(weighted_features, cluster_labels)
    silhouette_scores.append(silhouette_avg)

Plotting Silhouette Scores

The silhouette scores are plotted below to visualize the optimal number of clusters:


# Optional: Plot silhouette scores
plt.plot(range(2, 11), silhouette_scores, marker="o")
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score for Different Numbers of Clusters")
plt.show()

Clustering and Conversion to GeoDataFrame

Once the optimal number of clusters was identified, I performed clustering with KMeans and converted the dataset to a GeoDataFrame for further analysis in tools like QGIS. This process is summarized in the following code:


# Perform clustering with the optimal number of clusters
n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(weighted_features)

# Add the cluster labels to the original dataset
trees = pd.DataFrame(tree_locations, columns=["longitude", "latitude"])
trees["cluster"] = cluster_labels

# Convert the dataset to a GeoDataFrame
trees_gdf = gpd.GeoDataFrame(trees, geometry=gpd.points_from_xy(trees.longitude, trees.latitude))

Application and Implementation

The final clustered dataset can be visualized in QGIS or similar tools, providing forest managers with insights into tree density, types, and locations. By defining distinct clusters, the algorithm helps identify regions with specific characteristics, allowing targeted forest management strategies.